mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-30 17:48:27 -05:00
3722 lines
320 KiB
Plaintext
3722 lines
320 KiB
Plaintext
---
|
||
quiz: ml_systems_quizzes.json
|
||
concepts: ml_systems_concepts.yml
|
||
glossary: ml_systems_glossary.json
|
||
engine: jupyter
|
||
---
|
||
|
||
# ML Systems {#sec-ml-systems}
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: chapter-start
|
||
|
||
```
|
||
|
||
::: {layout-narrow}
|
||
::: {.column-margin}
|
||
\chapterminitoc
|
||
:::
|
||
|
||
\noindent
|
||
{fig-alt="Split-brain illustration with the left hemisphere showing circuit board patterns and processors on a white background, and the right hemisphere displaying a colorful neural network with various AI application icons and data connections on a blue background."}
|
||
|
||
:::
|
||
|
||
## Purpose {.unnumbered}
|
||
|
||
\begin{marginfigure}
|
||
\mlsysstack{35}{20}{30}{30}{30}{30}{25}{15}
|
||
\end{marginfigure}
|
||
|
||
_Why does deploying the same model to a phone versus a datacenter demand fundamentally different engineering?_
|
||
|
||
The defining insight of ML systems engineering is that constraints drive architecture. The speed of light sets an absolute floor on how quickly distant servers can respond. Thermodynamics limits how much computation can occur in a given volume before heat becomes unmanageable. Memory physics makes moving data often more expensive than processing it. These are not engineering limitations awaiting better technology; they are permanent physical boundaries that partition the world into fundamentally distinct operating regimes. A datacenter can train billion-parameter models but cannot guarantee low-latency responses to users thousands of miles away. A smartphone can respond instantly but has a fraction of the memory budget. A microcontroller can run on a coin-cell battery for years but has barely enough compute for a simple keyword detector. The same model—the same algorithm applied to the same data—demands radically different engineering in each regime, not because of design preferences but because different physics governs each environment. Teams that treat deployment as an afterthought—training a model in the cloud and then asking "how do we ship this?"—discover too late that the physics of their target environment invalidates months of architectural decisions. Understanding these regimes transforms deployment from an operational detail into a first-order engineering decision: the question is never simply "how do I make this model work?" but rather "which physical constraints govern my problem, and how do they shape what is even possible?"
|
||
|
||
::: {.content-visible when-format="pdf"}
|
||
\newpage
|
||
:::
|
||
|
||
::: {.callout-tip title="Learning Objectives"}
|
||
|
||
- Explain how physical constraints (speed of light, **Power Wall**, **Memory Wall**) necessitate the deployment spectrum from cloud to TinyML.
|
||
- Apply the **Iron Law** and **Bottleneck Principle** to determine whether a workload is compute-bound, memory-bound, or I/O-bound.
|
||
- Map workload archetypes to deployment paradigms using **Lighthouse Model** examples.
|
||
- Distinguish the four **deployment paradigms** (Cloud, Edge, Mobile, TinyML) by their operational characteristics and quantitative trade-offs.
|
||
- Apply the **decision framework** to select deployment paradigms based on privacy, latency, computational, and cost requirements.
|
||
- Analyze hybrid integration patterns to determine which combinations address specific system constraints.
|
||
- Evaluate deployment decisions by identifying common fallacies (including Amdahl's Law limits on system speedup) and assessing alignment between architecture and requirements.
|
||
- Identify the universal principles (data pipelines, resource management, system architecture) that apply across deployment paradigms and explain why optimization techniques transfer between scales.
|
||
|
||
:::
|
||
|
||
```{python}
|
||
#| label: ml-systems-setup
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ DEPLOYMENT OVERVIEW: LATENCY AND POWER RANGES
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @tbl-deployment-paradigms-overview (~30 lines below),
|
||
# │ also reused in @tbl-deployment-thresholds and
|
||
# │ @sec-ml-systems-mobile-ml-benefits-resource-constraints-c568.
|
||
# │
|
||
# │ Goal: Provide latency and power ranges for the four deployment paradigms.
|
||
# │ Show: Quantitative latency and power trade-offs in the overview table.
|
||
# │ How: Read latency ranges and TDP range from centralized constants.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (CLOUD_LATENCY_RANGE_MS, EDGE_LATENCY_RANGE_MS,
|
||
# │ MOBILE_LATENCY_RANGE_MS, TINY_LATENCY_RANGE_MS, MOBILE_TDP_RANGE_W)
|
||
# │ Exports: MLSystemsSetup.cloud_latency_range_str,
|
||
# │ MLSystemsSetup.edge_latency_range_str,
|
||
# │ MLSystemsSetup.mobile_latency_range_str,
|
||
# │ MLSystemsSetup.tiny_latency_range_str,
|
||
# │ MLSystemsSetup.mobile_tdp_range_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.core.constants import (
|
||
CLOUD_LATENCY_RANGE_MS, EDGE_LATENCY_RANGE_MS,
|
||
MOBILE_LATENCY_RANGE_MS, TINY_LATENCY_RANGE_MS,
|
||
MOBILE_TDP_RANGE_W
|
||
)
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class MLSystemsSetup:
|
||
"""
|
||
Namespace for deployment paradigm overview ranges.
|
||
Scenario: Latency and power ranges for the four paradigms.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
cloud_latency_range = CLOUD_LATENCY_RANGE_MS
|
||
edge_latency_range = EDGE_LATENCY_RANGE_MS
|
||
mobile_latency_range = MOBILE_LATENCY_RANGE_MS
|
||
tiny_latency_range = TINY_LATENCY_RANGE_MS
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
# (pass-through: ranges are pre-formatted strings from constants)
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
cloud_latency_range_str = cloud_latency_range
|
||
edge_latency_range_str = edge_latency_range
|
||
mobile_latency_range_str = mobile_latency_range
|
||
tiny_latency_range_str = tiny_latency_range
|
||
mobile_tdp_range_str = MOBILE_TDP_RANGE_W
|
||
```
|
||
|
||
## Deployment Paradigm Framework {#sec-ml-systems-deployment-paradigm-framework-0d25}
|
||
|
||
\index{physical constraints!deployment implications}Where an ML model runs shapes what is possible in ways no algorithmic choice can override. Yet deployment is far harder than it appears, and the reason is not the model itself. In production ML systems, the model accounts for roughly 5% of the codebase [@sculley2015hidden]. The remaining 95% consists of data collection, feature processing, serving infrastructure, monitoring, and resource management. All of this surrounding infrastructure changes dramatically depending on where the model executes.
|
||
|
||
Consider two extremes: a wake-word detector on a smartwatch and a recommendation engine in a data center. The wake-word detector represents a **TinyML** workload operating under milliwatt power budgets and kilobyte memory limits; the recommendation engine exemplifies a **Cloud ML** workload requiring terabytes of embedding tables and megawatt-scale infrastructure. These systems solve different problems under opposite physical constraints, and the infrastructure that supports them shares almost nothing in common. This reality transforms deployment from an operational afterthought into a first-order engineering decision, one that the AI Triad from @sec-introduction helps us reason about by foregrounding infrastructure alongside data and algorithms.
|
||
|
||
What makes these systems so different? The physical constraints that govern each environment—latency, power, and memory—force ML deployment into four distinct paradigms, each with its own engineering trade-offs and system design patterns. **Cloud ML**\index{Cloud ML!characteristics} aggregates computational resources in data centers, offering virtually unlimited compute and storage at the cost of network latency. **Edge ML**\index{Edge ML!latency benefits} moves computation closer to where data originates—factory floors, retail stores, hospitals—achieving lower latency and keeping sensitive data on-premises. **Mobile ML**\index{Mobile ML!energy constraints} brings intelligence directly to smartphones and tablets, balancing computational capability against battery life and thermal constraints. **TinyML**\index{TinyML!always-on sensing} pushes intelligence to the smallest devices—microcontrollers costing dollars and consuming milliwatts—enabling always-on sensing that runs for months on a coin-cell battery. These four paradigms span nine orders of magnitude in power consumption (megawatts to milliwatts) and memory capacity (terabytes to kilobytes), a range so vast that the engineering principles governing one end of the spectrum barely apply at the other.
|
||
|
||
These four paradigms exist not because of engineering choices but because of physical laws that no amount of optimization can overcome. Three fundamental constraints—the speed of light (establishing latency floors), thermodynamic limits on power dissipation (capping computation per watt), and the energy cost of memory signaling (creating the Memory Wall)—carve the deployment landscape into distinct operating regimes. These are not design preferences but physical boundaries: you cannot serve a self-driving car from a data center 36 ms away, and you cannot train a 1.5-billion-parameter model on a microcontroller.
|
||
|
||
## The Architectural Anchor: The Single-Node Stack {#sec-ml-systems-architectural-anchor}
|
||
|
||
\index{Single-Node Stack!architecture layers}To navigate these operating regimes, we anchor our engineering decisions in a four-layer architectural model of the **Single-Node Stack**. This model provides the foundational framework for analyzing any ML system before it is projected onto a larger distributed fleet. Understanding how these layers interact within a single machine is the technical prerequisite for mastering larger scales.
|
||
|
||
1. **Application (The Mission)**: The top layer where high-level requirements—throughput for training loops or latency for inference serving—are defined. This is where the "Dual Mandate" of accuracy and physics is managed (@sec-model-training, @sec-model-serving).
|
||
2. **ML Framework (The Compiler)**: The translation layer (PyTorch, JAX) that maps high-level math to hardware-specific execution plans. It manages the computational graph, automatic differentiation, and memory scheduling (@sec-ml-frameworks).
|
||
3. **Operating System (The Runtime)**: The interface between framework and hardware, responsible for the low-level orchestration of resources. This includes the **CUDA Runtime** for kernel management and **PCIe DMA** (Direct Memory Access) for efficient data movement between host and device.
|
||
4. **Hardware (The Silicon)**: The physical foundation where bits are transformed. This layer is defined by HBM (High Bandwidth Memory) capacity and high-speed intra-node interconnects like **NVLink (900 GB/s)**. Here, the **Memory Wall** acts as the primary physical constraint (@sec-hardware-acceleration).
|
||
|
||
Every chapter in the first half of this text interrogates one or more of these layers. Mastery of this single-node regime establishes the "Silicon Contract" that governs all subsequent optimization and scaling efforts.
|
||
|
||
These physical constraints interact with the **Iron Law of ML Systems** (@sec-introduction-iron-law-ml-systems-c32a), which decomposes end-to-end latency into data movement, computation, and overhead.
|
||
Different deployment environments stress different terms of this equation: cloud systems are typically compute-bound, mobile systems hit power walls, and TinyML devices are memory-capacity-limited. By pairing the physical constraints with the Iron Law, we develop a quantitative vocabulary for reasoning about *which* paradigm fits a given workload and *why*. To anchor this analysis concretely, the chapter introduces five **Lighthouse Models**—ResNet-50, GPT-2, DLRM, MobileNet, and a Keyword Spotter—that span the deployment spectrum and isolate distinct system bottlenecks. These reference workloads recur throughout the book, providing a consistent basis for comparing optimization techniques across chapters.
|
||
|
||
The physics that creates these paradigm boundaries comes first, followed by the analytical tools (Iron Law, Bottleneck Principle, Workload Archetypes) for mapping workloads to deployment targets. Each paradigm then receives an in-depth treatment covering infrastructure, trade-offs, and representative applications. The chapter closes with a comparative decision framework and the hybrid architectures that combine paradigms when no single deployment target satisfies all requirements.
|
||
|
||
These four paradigms function as distinct operating envelopes, each defined by how much power, memory, and network connectivity is available. Every ML application must fit within at least one of these envelopes, and that fit determines which algorithms, hardware, and engineering trade-offs apply. The four paradigms span a continuous spectrum from centralized cloud infrastructure to distributed ultra-low-power devices. @fig-cloud-edge-TinyML-comparison traces this spectrum visually, mapping where each paradigm sits along the centralization axis, while @tbl-deployment-paradigms-overview pins down the quantitative trade-offs.
|
||
|
||
::: {#fig-cloud-edge-TinyML-comparison fig-env="figure" fig-pos="t" fig-cap="**Distributed Intelligence Spectrum**: Machine learning deployment spans from centralized cloud infrastructure to resource-constrained TinyML devices, each balancing processing location, device capability, and network dependence. Source: [@abiresearch2024tinyml]." fig-alt="Horizontal spectrum showing 5 deployment tiers from left to right: ultra-low-power devices and sensors, intelligent device, gateway, on-premise servers, and cloud. Arrows indicate TinyML, Edge AI, and Cloud AI spans across the spectrum."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[line cap=round,line join=round,font=\usefont{T1}{phv}{m}{n}\small]
|
||
% Parameters
|
||
\def\angle{10} % angle
|
||
\def\length{18} % Lengths (cm)
|
||
\def\npoints{5} % number of points
|
||
\def\startfrac{0.13} % start (e.g.. 0.2 = 20%)
|
||
\def\endfrac{0.87} % end (e.g.. 0.8 = 80%)
|
||
|
||
\draw[line width=1pt, black!70] (0,0) -- ({\length*cos(\angle)}, {\length*sin(\angle)})coordinate(end);
|
||
%
|
||
\foreach \i in {0,1,...,\numexpr\npoints-1} {
|
||
\pgfmathsetmacro{\t}{\startfrac + (\endfrac - \startfrac)*\i/(\npoints-1)}
|
||
\coordinate(T\i)at({\t*\length*cos(\angle)}, {\t*\length*sin(\angle)});
|
||
}
|
||
|
||
\tikzset {
|
||
pics/gatewey/.style = {
|
||
code = {
|
||
\colorlet{red}{white}
|
||
\begin{scope}[local bounding box=GAT,scale=0.9, every node/.append style={transform shape}]
|
||
\def\rI{4mm}
|
||
\def\rII{2.8mm}
|
||
\def\rIII{1.6mm}
|
||
\draw[red,line width=1.25pt](0,0)--(0,0.38)--(1.2,0.38)--(1.2,0)--cycle;
|
||
\draw[red,line width=1.5pt](0.6,0.4)--(0.6,0.9);
|
||
|
||
\draw[red, line width=1.5pt] (0.6,0.9)+(60:\rI) arc[start angle=60, end angle=-60, radius=\rI];
|
||
\draw[red, line width=1.5pt] (0.6,0.9)+(50:\rII) arc[start angle=50, end angle=-50, radius=\rII];
|
||
\draw[red, line width=1.5pt] (0.6,0.9)+(30:\rIII) arc[start angle=30, end angle=-30, radius=\rIII];
|
||
%
|
||
\draw[red, line width=1.5pt] (0.6,0.9)+(120:\rI) arc[start angle=120, end angle=240, radius=\rI];
|
||
\draw[red, line width=1.5pt] (0.6,0.9)+(130:\rII) arc[start angle=130, end angle=230, radius=\rII];
|
||
\draw[red, line width=1.5pt] (0.6,0.9)+(150:\rIII) arc[start angle=150, end angle=210, radius=\rIII];
|
||
\fill[red](0.6,0.9)circle (1.5pt);
|
||
|
||
\foreach\i in{0.15,0.3,0.45,0.6}{
|
||
\fill[red](\i,0.19)circle (1.5pt);
|
||
}
|
||
|
||
\fill[red](1,0.19)circle (2pt);
|
||
\end{scope}
|
||
}}}
|
||
|
||
\tikzset {
|
||
pics/cloud/.style = {
|
||
code = {
|
||
\colorlet{red}{white}
|
||
\begin{scope}[local bounding box=CLO,scale=0.6, every node/.append style={transform shape}]
|
||
\draw[red,line width=1.5pt](0,0)to[out=170,in=180,distance=11](0.1,0.61)
|
||
to[out=90,in=105,distance=17](1.07,0.71)
|
||
to[out=20,in=75,distance=7](1.48,0.36)
|
||
to[out=350,in=0,distance=7](1.48,0)--(0,0);
|
||
\draw[red,line width=1.5pt](0.27,0.71)to[bend left=25](0.49,0.96);
|
||
\draw[red,line width=1.5pt](0.67,1.21)to[out=55,in=90,distance=13](1.5,0.96)
|
||
to[out=360,in=30,distance=9](1.68,0.42);
|
||
\end{scope}
|
||
}}}
|
||
|
||
\tikzset {
|
||
pics/server/.style = {
|
||
code = {
|
||
\colorlet{red}{white}
|
||
\begin{scope}[anchor=center, transform shape,scale=0.8, every node/.append style={transform shape}]
|
||
\draw[red,line width=1.25pt,fill=white](-0.55,-0.5) rectangle (0.55,0.5);
|
||
\foreach \i in {-0.25,0,0.25} {
|
||
\draw[BlueLine,line width=1.25pt]( -0.55,\i) -- (0.55, \i);
|
||
}
|
||
\foreach \i in {-0.375, -0.125, 0.125, 0.375} {
|
||
\draw[BlueLine,line width=1.25pt](-0.45,\i)--(0,\i);
|
||
\fill[BlueLine](0.35,\i) circle (1.5pt);
|
||
}
|
||
|
||
\draw[red,line width=1.75pt](0,-0.53) |- (-0.55,-0.7);
|
||
\draw[red,line width=1.75pt](0,-0.53) |- (0.55,-0.7);
|
||
\end{scope}
|
||
}
|
||
}
|
||
}
|
||
|
||
\tikzset {
|
||
pics/cpu/.style = {
|
||
code = {
|
||
\definecolor{CPU}{RGB}{0,120,176}
|
||
\colorlet{CPU}{white}
|
||
\begin{scope}[local bounding box = CPU,scale=0.33, every node/.append style={transform shape}]
|
||
\node[fill=CPU,minimum width=66, minimum height=66,
|
||
rounded corners=2,outer sep=2pt] (C1) {};
|
||
\node[fill=violet,minimum width=54, minimum height=54] (C2) {};
|
||
%\node[fill=CPU!40,minimum width=44, minimum height=44] (C3) {CPU};
|
||
|
||
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
|
||
\node[fill=CPU,minimum width=4, minimum height=15,
|
||
inner sep=0pt,anchor=south](GO\y)at($(C1.north west)!\x!(C1.north east)$){};
|
||
}
|
||
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
|
||
\node[fill=CPU,minimum width=4, minimum height=15,
|
||
inner sep=0pt,anchor=north](DO\y)at($(C1.south west)!\x!(C1.south east)$){};
|
||
}
|
||
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
|
||
\node[fill=CPU,minimum width=15, minimum height=4,
|
||
inner sep=0pt,anchor=east](LE\y)at($(C1.north west)!\x!(C1.south west)$){};
|
||
}
|
||
\foreach \x/\y in {0.11/1,0.26/2,0.41/3,0.56/4,0.71/5,0.85/6}{
|
||
\node[fill=CPU,minimum width=15, minimum height=4,
|
||
inner sep=0pt,anchor=west](DE\y)at($(C1.north east)!\x!(C1.south east)$){};
|
||
}
|
||
\end{scope}
|
||
} }}
|
||
|
||
\tikzset {
|
||
pics/mobile/.style = {
|
||
code = {
|
||
\colorlet{red}{white}
|
||
\begin{scope}[local bounding box=MOB,scale=0.4, every node/.append style={transform shape}]
|
||
\node[rectangle,draw=red,minimum height=94,minimum width=47,
|
||
rounded corners=6,thick,fill=white](R1){};
|
||
\node[rectangle,draw=red,minimum height=67,minimum width=38,thick,fill=GreenFill](R2){};
|
||
\node[circle,minimum size=8,below= 2pt of R2,inner sep=0pt,thick,fill=GreenFill]{};
|
||
\node[rectangle,fill=GreenFill,minimum height=2,minimum width=20,above= 4pt of R2,inner sep=0pt,thick]{};
|
||
%
|
||
\end{scope}
|
||
} }}
|
||
|
||
\node[draw=none,fill=RedFill,circle,minimum size=20mm](GA)at(T2){};
|
||
\pic[shift={(-0.55,-0.5)}] at (T2) {gatewey};
|
||
\node[above=0 of GA]{Gateway};
|
||
\node[draw=none,fill=VioletL,circle,minimum size=20mm](CP)at(T0){};
|
||
\pic[shift={(0,-0)}] at (T0) {cpu};
|
||
\node[above=0 of CP,align=center]{Ultra Low Powered\\Devices and Sensors};
|
||
\node[draw=none,fill=GreenFill,,circle,minimum size=20mm](MO)at(T1){};
|
||
\pic[shift={(0,0)}] at (T1) {mobile};
|
||
\node[above=0 of MO,align=center]{Intelligent\\Device};
|
||
\node[draw=none,fill=BlueFill,circle,minimum size=20mm](SE)at(T3){};
|
||
\pic[shift={(-0.03,0.1)}] at (T3) {server};
|
||
\node[above=0 of SE,align=center]{On Premise\\Servers};
|
||
\node[draw=none,fill=BrownL,circle,minimum size=20mm](CL)at(T4){};
|
||
\pic[shift={(-0.48,-0.35)}] at (T4) {cloud};
|
||
\node[above=0 of CL,align=center]{Cloud};
|
||
%
|
||
\path (T0) -- (T1) coordinate[pos=0.5] (M1);
|
||
\path (0,0) -- (T0) coordinate[pos=0.25] (M0);
|
||
\path (T3) -- (T4) coordinate[pos=0.5] (M2);
|
||
\path (T4) -- (end) coordinate[pos=0.75] (M3);
|
||
|
||
\foreach \x in {0,1,2,3}{
|
||
\fill[OliveLine](M\x)circle (2.5pt);
|
||
}
|
||
|
||
\path[red](M0)--++(270:1.6)coordinate(LL1)-|coordinate(LL2)(M2);
|
||
\path[red](M0)--++(270:1.1)coordinate(L1)-|coordinate(L2)(M1);
|
||
\path[red](M0)--++(270:1.1)-|coordinate(L3)(M2);
|
||
\path[red](M0)--++(270:1.1)-|coordinate(L4)(M3);
|
||
%
|
||
\draw[black!70,thick](M0)--(LL1);
|
||
\draw[black!70,thick](M1)--(L2);
|
||
\draw[black!70,thick](M3)--(L4);
|
||
\draw[black!70,thick](M2)--(LL2);
|
||
\draw[latex-latex,line width=1pt,draw=black!60](L1)--node[red,fill=white]{TinyML}(L2);
|
||
\draw[latex-latex,line width=1pt,draw=black!60](L3)--node[fill=white]{Cloud AI}(L4);
|
||
\draw[latex-latex,line width=1pt,draw=black!60]([yshift=4pt]LL1)--node[fill=white,text=black]{Edge AI}([yshift=4pt]LL2);
|
||
\foreach \x in {0,1,2,3}{
|
||
\fill[OliveLine](M\x)circle (2.5pt);
|
||
}
|
||
%
|
||
\path[](M0)--++(90:4.2)-|node[pos=0.25]{\textbf{The Distributed Intelligence Spectrum}}(M3);
|
||
\end{tikzpicture}
|
||
|
||
```
|
||
:::
|
||
|
||
@tbl-deployment-paradigms-overview compares the quantitative trade-offs across these four paradigms:
|
||
|
||
| **Paradigm** | **Where** | **Latency** | **Power** | **Memory** | **Best For** |
|
||
|:--------------|:-----------------|------------------------------------------------------:|:-------------------------------------------------|:-----------|:-----------------------------|
|
||
| **Cloud ML** | Data centers | `{python} MLSystemsSetup.cloud_latency_range_str` ms | MW | TB | Training, complex inference |
|
||
| **Edge ML** | Local servers | `{python} MLSystemsSetup.edge_latency_range_str` ms | 100s W | GB | Real-time inference, privacy |
|
||
| **Mobile ML** | Smartphones | `{python} MLSystemsSetup.mobile_latency_range_str` ms | `{python} MLSystemsSetup.mobile_tdp_range_str` W | GB | Personal AI, offline |
|
||
| **TinyML** | Microcontrollers | `{python} MLSystemsSetup.tiny_latency_range_str` ms | mW | KB | Always-on sensing |
|
||
|
||
: **The Deployment Spectrum (Conceptual)**: Four paradigms span nine orders of magnitude in power (MW to mW) and memory (TB to KB). This conceptual overview defines each paradigm by its operating regime; @tbl-representative-systems later grounds these categories in specific hardware platforms and quantitative decision thresholds. The hardware specifications and physical constants underpinning these numbers are catalogued in the System Assumptions appendix. {#tbl-deployment-paradigms-overview}
|
||
|
||
The nine-order-of-magnitude span in @tbl-deployment-paradigms-overview is not an accident of engineering history—it is a consequence of physics. No amount of optimization can make a datacenter respond faster than light can travel, or make a microcontroller dissipate more heat than its surface area allows. The question "why do these four paradigms exist, rather than a single universal solution?" has a precise answer rooted in three physical laws.
|
||
|
||
## Physical Constraints: Why Paradigms Exist {#sec-ml-systems-deployment-spectrum-71be}
|
||
|
||
\index{Silicon Contract!physical constraints} \index{physical constraints!speed of light} \index{physical constraints!thermodynamics} \index{physical constraints!memory signaling}The physical laws of speed of light, power thermodynamics, and memory signaling dictate that no single "ideal" computer exists. Where a system runs reshapes the contract between model and hardware. These three constraints—which we call the *Light Barrier*, *Power Wall*, and *Memory Wall*—govern the engineering trade-offs ahead.[^fn-paradigm-deployment]
|
||
|
||
### The Light Barrier {.unnumbered}
|
||
|
||
\index{Light Barrier!latency floor}The Light Barrier establishes the absolute latency[^fn-latency-systems] floor. The minimum round-trip time is governed by @eq-latency-physics:
|
||
|
||
$$\text{Latency}_{\min} = \frac{2 \times \text{Distance}}{c_{\text{fiber}}} \approx \frac{2 \times \text{Distance}}{200{,}000 \text{ km/s}}$$ {#eq-latency-physics}
|
||
|
||
```{python}
|
||
#| label: light-barrier-calc
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ LIGHT BARRIER LATENCY CALCULATION
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: "The Light Barrier" narrative (Physical Constraints section)
|
||
# │
|
||
# │ Goal: Demonstrate the physical latency floor of cloud computing.
|
||
# │ Show: Cross-country round-trip time exceeds tight real-time budgets.
|
||
# │ How: Calculate RTT for CA-to-VA fiber transmission using SPEED_OF_LIGHT.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants, mlsysim.book
|
||
# │ Exports: min_latency_str, distance_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.core.constants import SPEED_OF_LIGHT_FIBER_KM_S, ureg
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class LightLatency:
|
||
"""
|
||
Namespace for Light-Speed Latency calculation.
|
||
Scenario: Cross-country packet transmission (CA to VA) vs 10ms budget.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
distance_km = 3600 * ureg.km # California to Virginia (straight-line)
|
||
safety_budget_ms = 10
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
# Step 1: Latency = (Distance * 2) / Speed of Light (Round-trip time)
|
||
min_latency = (distance_km * 2) / SPEED_OF_LIGHT_FIBER_KM_S
|
||
min_latency_ms = min_latency.m_as(ureg.ms)
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(min_latency_ms > safety_budget_ms,
|
||
f"Physics allows cloud ({min_latency_ms:.1f}ms) within {safety_budget_ms}ms budget!")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
min_latency_str = fmt(min_latency_ms, precision=0, commas=False)
|
||
distance_str = f"{distance_km.m_as('km'):,}"
|
||
```
|
||
|
||
California to Virginia (~`{python} LightLatency.distance_str` km straight-line) requires **~`{python} LightLatency.min_latency_str` ms minimum** before any computation begins. Actual cloud services typically add 60–150 ms of software overhead. Applications requiring sub-10 ms response *cannot* use distant cloud infrastructure—physics forbids it. This constraint creates the need for **Edge ML** and **TinyML**: when latency budgets are tight, computation must move closer to the data source.
|
||
|
||
### The Power Wall {.unnumbered}
|
||
|
||
\index{Power Wall!thermal limits}\index{Power Wall!frequency scaling}
|
||
\index{Dennard scaling!breakdown}
|
||
The Power Wall emerged because thermodynamics limits how much computation can occur in a given volume. Under classical Dennard scaling[^fn-dennard-scaling-origin] (which held until approximately 2006), the relationship between power and frequency was cubic. Here $C$ is effective capacitance, $V$ is voltage, and $f$ is clock frequency. As voltage tracks frequency ($V \propto f$), power rises as $f^3$, as @eq-power-scaling shows:
|
||
|
||
$$\text{Power} \propto C \times V^2 \times f \quad \text{where } V \propto f \implies \text{Power} \propto f^3$$ {#eq-power-scaling}
|
||
|
||
```{python}
|
||
#| label: power-wall-throttling
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ POWER WALL: MOBILE THERMAL THROTTLING SCENARIO
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Power Wall prose immediately below — illustrates throttling
|
||
# │ behavior on battery-powered devices.
|
||
# │
|
||
# │ Goal: Provide concrete FPS numbers showing thermal throttling on mobile.
|
||
# │ Show: A mobile model drops from 60 FPS to 15 FPS after 1 minute.
|
||
# │ How: Simple illustrative constants (no computation needed).
|
||
# │
|
||
# │ Imports: (none)
|
||
# │ Exports: ThrottlingScenario.fps_start_str,
|
||
# │ ThrottlingScenario.fps_throttled_str,
|
||
# │ ThrottlingScenario.duration_min_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class ThrottlingScenario:
|
||
"""Namespace for illustrative mobile thermal throttling."""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
fps_start = 60
|
||
fps_throttled = 15
|
||
duration_min = 1
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
# (illustrative scenario — no derived quantities)
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
fps_start_str = f"{fps_start}"
|
||
fps_throttled_str = f"{fps_throttled}"
|
||
duration_min_str = f"{duration_min}"
|
||
```
|
||
|
||
Doubling clock frequency required approximately 8$\times$ more power. The breakdown of this scaling relationship ended the era of "free" speedups via frequency scaling and forced the industry toward the parallelism (multi-core) and specialization (GPUs, TPUs) that defines modern ML. Mobile devices hit hard thermal limits at `{python} MLSystemsSetup.mobile_tdp_range_str` W; exceeding this causes "throttling," where the device reduces performance to prevent overheating. In practice, this means a mobile model that runs at `{python} ThrottlingScenario.fps_start_str` FPS for `{python} ThrottlingScenario.duration_min_str` minute may throttle to `{python} ThrottlingScenario.fps_throttled_str` FPS as the device heats up. This physical limit gives rise to **Mobile ML**: battery-powered devices cannot simply run cloud-scale models locally.
|
||
|
||
[^fn-dennard-scaling-origin]: **Dennard Scaling**: Named after Robert Dennard (IBM, 1974), who showed that as transistors shrink, voltage and current scale proportionally, keeping power density constant. This held for three decades, delivering "free" performance gains each chip generation. When leakage current made further voltage reduction impossible around the 90 nm node (2005--2006), power density began rising with each generation---ending single-core frequency scaling and forcing the industry toward the parallelism and specialization (multi-core, GPU, TPU) that now defines ML hardware. \index{Dennard Scaling!origin and breakdown}
|
||
|
||
```{python}
|
||
#| label: memory-wall-calc
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ MEMORY WALL CALCULATION
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: "The Memory Wall" narrative (Physical Constraints section)
|
||
# │
|
||
# │ Goal: Quantify the widening gap between compute and bandwidth.
|
||
# │ Show: The 1.33× annual divergence that defines modern systems engineering.
|
||
# │ How: Compare historical growth rates for TFLOPS and memory bandwidth.
|
||
# │
|
||
# │ Imports: mlsysim.book (fmt)
|
||
# │ Exports: MemoryWall.compute_growth_str, MemoryWall.mem_bw_growth_str,
|
||
# │ MemoryWall.mem_wall_ratio_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class MemoryWall:
|
||
"""
|
||
Namespace for the Memory Wall calculation.
|
||
Scenario: Comparing annual growth rates of Compute vs Memory Bandwidth.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
compute_growth_annual = 1.6 # 60% increase/year
|
||
mem_bw_growth_annual = 1.2 # 20% increase/year
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
divergence_ratio = compute_growth_annual / mem_bw_growth_annual
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(divergence_ratio > 1.0, "Memory is keeping up with Compute (Gap <= 1x).")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
compute_growth_str = fmt(compute_growth_annual, precision=1, commas=False)
|
||
mem_bw_growth_str = fmt(mem_bw_growth_annual, precision=1, commas=False)
|
||
mem_wall_ratio_str = fmt(divergence_ratio, precision=2, commas=False)
|
||
```
|
||
|
||
### The Memory Wall {.unnumbered}
|
||
|
||
\index{Memory Wall!bandwidth divergence}\index{Memory Wall!compute-memory gap} The Memory Wall [@wulf1995memory] reflects the widening bandwidth[^fn-bandwidth-memory-wall] gap:
|
||
|
||
$$\frac{\text{Compute Growth}}{\text{Memory BW Growth}} \approx \frac{1.6\times\text{/year}}{1.2\times\text{/year}} \approx 1.33\times\text{/year}$$ {#eq-memory-wall}
|
||
|
||
```{python}
|
||
#| label: memory-wall-trends
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ MEMORY WALL: COMPUTE VS BANDWIDTH GROWTH RATES
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Prose immediately below @eq-memory-wall, quantifying the
|
||
# │ compute-memory divergence.
|
||
# │
|
||
# │ Goal: Provide the two growth-rate numbers cited in Memory Wall narrative.
|
||
# │ Show: Compute doubles every 18 months; memory BW grows ~20% annually.
|
||
# │ How: Canonical industry trend constants formatted for prose.
|
||
# │
|
||
# │ Imports: mlsysim.book (fmt)
|
||
# │ Exports: MemoryWallTrends.compute_doubling_months_str,
|
||
# │ MemoryWallTrends.mem_bw_growth_pct_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.fmt import fmt
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class MemoryWallTrends:
|
||
"""Namespace for compute vs memory bandwidth growth rates."""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
compute_doubling_months = 18
|
||
mem_bw_growth_pct = 20
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
# (pass-through: canonical trend values)
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
compute_doubling_months_str = fmt(compute_doubling_months, precision=0)
|
||
mem_bw_growth_pct_str = fmt(mem_bw_growth_pct, precision=0)
|
||
```
|
||
|
||
\index{data movement!energy dominance}
|
||
@eq-memory-wall quantifies this divergence: processors have doubled in compute capacity roughly every `{python} MemoryWallTrends.compute_doubling_months_str` months, but memory bandwidth has improved only ~`{python} MemoryWallTrends.mem_bw_growth_pct_str`% annually. This widening gap makes data movement the dominant bottleneck and energy cost for most ML workloads. This constraint affects all paradigms but is especially acute for **TinyML**, where devices have only kilobytes of memory to work with. We examine the hardware architectural responses to the Memory Wall, including HBM and on-chip SRAM hierarchies, in detail in @sec-hardware-acceleration.
|
||
|
||
::: {.callout-checkpoint title="Physical Constraints and Deployment"}
|
||
Deployment choices are governed by physics, not just preference. Check your understanding:
|
||
|
||
- [ ] **Light Barrier**: Can you explain why the speed of light makes Cloud ML impossible for <10 ms safety tasks?
|
||
- [ ] **Power Wall**: Do you understand why thermodynamics (heat dissipation) prevents datacenter models from running on mobile devices?
|
||
- [ ] **Memory Wall**: Can you explain why data movement is often more expensive (in time and energy) than computation?
|
||
:::
|
||
|
||
These physical laws explain *why* the four paradigms exist. Physics creates the boundaries; privacy regulation, economic incentives, and data sovereignty requirements reinforce and sharpen them. We examine these additional drivers within each paradigm section, but the central insight is that the paradigms would exist even without those concerns. No regulation can make the speed of light faster, and no economic model can repeal thermodynamics.
|
||
|
||
Knowing *that* these barriers exist is necessary but not sufficient. Given a specific ML workload—say, a recommendation engine or a wake-word detector—we need to determine *which* paradigm fits and *which* barrier the workload will hit first. The answer requires analytical tools that connect workload characteristics to these physical constraints: the Iron Law to decompose latency, the Bottleneck Principle to identify the dominant constraint, and a set of workload archetypes to classify where each model falls on the spectrum.
|
||
|
||
## Analyzing Workloads {#sec-ml-systems-analyzing-workloads-cbb8}
|
||
|
||
\index{Silicon Contract!Iron Law}The central analytical tool for this chapter is the **Iron Law of ML Systems**, established in @sec-introduction (@sec-introduction-iron-law-ml-systems-c32a) and restated here as @eq-iron-law:
|
||
|
||
$$T = \frac{D_{vol}}{BW} + \frac{O}{R_{peak} \cdot \eta} + L_{lat}$$ {#eq-iron-law}
|
||
|
||
This equation decomposes total latency into three terms: data movement ($D_{vol}/BW$), compute ($O / (R_{peak} \cdot \eta)$), and fixed overhead ($L_{lat}$). For a single inference, these costs simply add up—you pay each one sequentially. In production systems, however, tasks are processed continuously as a stream, and the question shifts from "*how* long does one task take?" to "*which* of these three terms actually limits the system?" The answer depends entirely on the deployment environment: a model that is compute-bound during training may become memory-bound during inference; a system that runs efficiently in the cloud may hit power limits on mobile devices. To determine which term dominates, we need a companion principle.
|
||
|
||
### The Bottleneck Principle {#sec-ml-systems-bottleneck-principle-3514}
|
||
|
||
\index{bottleneck principle!pipelined execution} \index{system bottlenecks!identifying} \index{compute-bound vs memory-bound!definition}
|
||
\index{pipelined execution!throughput analysis}
|
||
The Iron Law tells us the cost of each term. The **Bottleneck Principle** tells us which term *matters*. Unlike traditional software where optimizing the average case works, ML systems are dominated by their slowest component: optimizing fast operations yields zero benefit while the slowest stage remains unchanged. Modern accelerators use **pipelined execution** to overlap data movement with computation: while the accelerator computes on batch $n$, the memory system prefetches batch $n+1$. With this overlap, the system's throughput is determined by whichever operation is slower—the faster one "hides" behind it. The Iron Law's sum becomes a maximum, as @eq-bottleneck formalizes:
|
||
|
||
$$ T_{bottleneck} = \max\left(\frac{D_{vol}}{BW}, \frac{O}{R_{peak} \cdot \eta}, T_{network}\right) + L_{lat} $$ {#eq-bottleneck}
|
||
|
||
* **$\frac{D_{vol}}{BW}$ (Memory)**: Time to move data between memory and processor.
|
||
* **$\frac{O}{R_{peak} \cdot \eta}$ (Compute)**: Time to execute calculations.
|
||
* **$T_{network}$**: Time for network communication (if offloading).
|
||
* **$L_{lat}$ (Overhead)**: Fixed latency (kernel launch, runtime overhead).
|
||
|
||
This principle dictates that if your system is **Memory Bound**\index{memory-bound workloads!optimization strategy}\index{compute-bound vs memory-bound!memory-bound} ($D_{vol}/BW > O/(R_{peak} \cdot \eta)$), buying faster processors ($R_{peak}$) yields exactly **0% speedup**—just as widening a six-lane highway yields no benefit when all traffic must funnel through a two-lane bridge. You must identify the dominant term before optimizing. This trade-off is governed by *the energy of transmission*.
|
||
|
||
```{python}
|
||
#| label: energy-transmission-calc
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ ENERGY OF TRANSMISSION CALCULATION
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Callout "The Energy of Transmission" (Bottleneck Principle section)
|
||
# │
|
||
# │ Goal: Compare energy costs of cloud offload vs. local compute.
|
||
# │ Show: The 1000× higher energy cost of network transmission for small data segments.
|
||
# │ How: Calculate Joules for data transfer vs. NPU-based local inference.
|
||
# │
|
||
# │ Imports: mlsysim.book (fmt)
|
||
# │ Exports: EnergyTransmission.data_mb_str, EnergyTransmission.tx_energy_str,
|
||
# │ EnergyTransmission.compute_energy_str,
|
||
# │ EnergyTransmission.cloud_total_str,
|
||
# │ EnergyTransmission.local_total_str, EnergyTransmission.ratio_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class EnergyTransmission:
|
||
"""
|
||
Namespace for Energy of Transmission vs Compute.
|
||
Scenario: Cost of sending 1MB to cloud vs running MobileNet locally.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
data_size_mb = 1.0 # 1 sec audio
|
||
tx_energy_per_mb = 100.0 # mJ/MB (Wi-Fi/LTE)
|
||
local_energy_op = 0.1 # mJ/inference (MobileNet on NPU)
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
cloud_energy_total = data_size_mb * tx_energy_per_mb
|
||
local_energy_total = local_energy_op
|
||
|
||
ratio = cloud_energy_total / local_energy_total
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(ratio >= 500, f"Transmission ({cloud_energy_total}mJ) is not expensive enough vs Compute ({local_energy_total}mJ). Ratio: {ratio:.1f}x")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
data_mb_str = fmt(data_size_mb, precision=0, commas=False)
|
||
tx_energy_str = fmt(tx_energy_per_mb, precision=0, commas=False)
|
||
compute_energy_str = fmt(local_energy_op, precision=1, commas=False)
|
||
cloud_total_str = fmt(cloud_energy_total, precision=0, commas=False)
|
||
local_total_str = fmt(local_energy_op, precision=1, commas=False)
|
||
ratio_str = fmt(ratio, precision=0, commas=True)
|
||
```
|
||
|
||
::: {.callout-notebook title="The Energy of Transmission"}
|
||
|
||
\index{energy of transmission!local vs cloud} \index{Energy Wall!battery constraints}**Problem**: Should a battery-powered sensor process data locally (TinyML) or send it to the cloud?
|
||
|
||
**The Variables**:
|
||
|
||
* **Data ($D_{vol}$)**: `{python} EnergyTransmission.data_mb_str` MB (e.g., 1 second of audio).
|
||
* **Transmission Energy ($E_{tx}$)**: `{python} EnergyTransmission.tx_energy_str` mJ/MB (Wi-Fi/LTE).
|
||
* **Compute Energy ($E_{op}$)**: `{python} EnergyTransmission.compute_energy_str` mJ/inference (MobileNet on NPU).
|
||
|
||
**The Calculation**:
|
||
|
||
1. **Cloud Approach**: $E_{cloud} \approx D_{vol} \times E_{tx}$ = `{python} EnergyTransmission.data_mb_str` MB$\times$ `{python} EnergyTransmission.tx_energy_str` mJ/MB = **`{python} EnergyTransmission.cloud_total_str` mJ**.
|
||
2. **Local Approach**: $E_{local} \approx$ Inference = **`{python} EnergyTransmission.local_total_str` mJ**.
|
||
|
||
**The Systems Conclusion**: Transmitting raw data is **`{python} EnergyTransmission.ratio_str`$\times$ more expensive** than processing it locally. Even if the cloud had infinite speed ($Time \approx 0$), the **Energy Wall** makes cloud offloading physically impossible for always-on battery devices. The "Machine" constraint (Battery) dictates the "Algorithm" choice (TinyML).
|
||
:::
|
||
|
||
The **Iron Law's** variables interact differently across deployment scenarios. Before examining specific workload archetypes, verify your understanding of these core performance determinants.
|
||
|
||
::: {.callout-definition title="The Iron Law"}
|
||
|
||
***The Iron Law***\index{Iron Law!definition} is the fundamental physical constraint governing all machine learning performance, expressed as the total time $T$ required for a workload:
|
||
$$T = \frac{D_{vol}}{BW} + \frac{O}{R_{peak} \cdot \eta} + L_{lat}$$
|
||
|
||
1. **Significance (Quantitative):** It defines the **Physical Ceiling** for any system by quantifying the relationship between data volume ($D_{vol}$), compute capacity ($R_{peak}$), and communication overhead ($L_{lat}$).
|
||
2. **Distinction (Durable):** Unlike **Amdahl's Law**, which focuses on **Parallel Speedup**, the Iron Law addresses the **Total Energy and Time** required to move and transform data.
|
||
3. **Common Pitfall:** A frequent misconception is that these terms are independent. In reality, they are **Trade-off Axes**: for example, increasing batch size may improve the duty cycle ($\eta$) but also increase the data volume ($D_{vol}$) per request, potentially shifting a compute-bound problem to a memory-bound one.
|
||
|
||
:::
|
||
|
||
The Iron Law quantifies the *cost of each ingredient*; the Bottleneck Principle identifies the *speed of the assembly line*. As a rule of thumb, use the **additive form** (@eq-iron-law) when analyzing the **latency** of a single task, and the **max form** (@eq-bottleneck) when analyzing the **throughput** of a continuous stream of tasks.
|
||
|
||
### Workload Archetypes {#sec-ml-systems-workload-archetypes-fd10}
|
||
|
||
\index{D·A·M taxonomy!workload classification}
|
||
The Bottleneck Principle raises an immediate question: for a given workload, which constraint dominates? The answer depends on the **D·A·M taxonomy** from @sec-introduction, which decomposes every ML system into **Data**, **Algorithm**, and **Machine**. Different deployment environments create different bottlenecks along these axes—a cloud server with terabytes of memory faces Algorithm constraints, while a microcontroller with kilobytes faces Machine constraints.
|
||
|
||
To navigate these constraints systematically, we categorize ML workloads into four **Archetypes**\index{Workload Archetypes}[^fn-archetype-bottleneck]. These represent the primary physical bottlenecks, not just specific model architectures. We introduce each archetype briefly here; the Lighthouse Models that follow will ground each category in concrete, recurring examples.
|
||
|
||
The first archetype, the **Compute Beast**\index{arithmetic intensity!high intensity workloads}, describes workloads that perform many calculations per byte of data loaded. The binding constraint is raw computational throughput. Training large neural networks falls into this category.
|
||
|
||
The second archetype, the **Bandwidth Hog**\index{autoregressive generation!memory-bound}, describes workloads that spend more time loading data than computing. Memory bandwidth becomes the binding constraint. Autoregressive text generation (like ChatGPT producing one token at a time) falls into this category.
|
||
|
||
The third archetype, the **Sparse Scatter**\index{embedding tables!memory capacity bound}, describes workloads with irregular memory access patterns and poor cache locality. Memory capacity and access latency constrain performance. Recommendation systems with massive embedding tables are canonical examples.
|
||
|
||
The fourth archetype, the **Tiny Constraint**\index{energy per inference!binding constraint}\index{always-on sensing!power constraints}, describes workloads operating under extreme power envelopes ($< 1$ mW) and memory limits ($< 256$ KB). The binding constraint is energy per inference—efficiency, not raw speed. Always-on sensing operates in this regime.
|
||
|
||
These archetypes map naturally to deployment paradigms: Compute Beasts and Sparse Scatter workloads gravitate toward **Cloud ML** where resources are abundant. Bandwidth Hogs span Cloud and Edge depending on latency requirements. Tiny Constraint workloads are exclusively **TinyML** territory. To make these abstractions concrete, we anchor each archetype to a specific model that recurs throughout this book as one of *five reference workloads*.
|
||
|
||
\index{archetype!workload classification}
|
||
|
||
[^fn-archetype-bottleneck]: **Workload Archetype**: A classification of ML workloads by their dominant Iron Law bottleneck rather than their model family. The distinction matters because the optimization strategy differs fundamentally: a compute-bound workload benefits from faster arithmetic ($R_{peak}$), while a bandwidth-bound workload benefits only from wider memory buses ($BW$). Misidentifying the archetype wastes optimization effort on the wrong term of the Iron Law, as when teams add accelerator FLOPS to a memory-bound inference pipeline and observe zero speedup. \index{Archetype!workload classification}
|
||
|
||
\index{paradigm!deployment regimes}
|
||
|
||
[^fn-paradigm-deployment]: **Deployment Paradigm**: A distinct operating regime whose boundaries are set by physics, not convention. The Cloud-to-TinyML spectrum spans nine orders of magnitude in power because thermodynamic and electromagnetic constraints create hard walls that no software optimization can cross, forcing qualitatively different system architectures at each tier. Misidentifying the paradigm boundary wastes engineering effort: optimizing a cloud model for 5% higher throughput is pointless if the application's 10 ms latency budget demands edge deployment. \index{Paradigm!deployment regimes}
|
||
|
||
\index{latency!responsiveness constraint}
|
||
|
||
[^fn-latency-systems]: **Latency**: The time between issuing a request and receiving a result, corresponding to $L_{lat}$ in the Iron Law. The Light Barrier makes this floor irreducible: the speed of light in fiber imposes a ~36 ms minimum round trip across the continental US, consuming the entire latency budget of a 10 ms safety-critical system before any computation begins. Every millisecond consumed by distance is a millisecond unavailable for model inference, which is why the Light Barrier forces paradigm selection rather than mere optimization. \index{Latency!deployment constraint}
|
||
|
||
\index{bandwidth!memory wall}
|
||
|
||
[^fn-bandwidth-memory-wall]: **Memory Bandwidth (The Memory Wall)**: The term "Memory Wall" was coined by Wulf and McKee in 1995, who predicted that the processor-memory performance gap would eventually dominate system performance---a prediction that proved prescient for ML workloads where weight loading, not arithmetic, is the typical bottleneck. In the Iron Law, bandwidth ($BW$) appears in the denominator of the data term $D_{vol}/BW$, so every doubling of model size that is not matched by a doubling of memory bandwidth directly increases wall-clock time. This asymmetry, growing at roughly 1.33$\times$ per year, is why modern ML systems are more often memory-bound than compute-bound. \index{Bandwidth!memory wall}
|
||
|
||
\index{critical path!latency determination}
|
||
|
||
[^fn-critical-path-ml]: **Critical Path**: The longest sequential chain of dependent operations in a pipeline. The decision rule in the triggering sentence is strict: if a 200 ms cross-region network call appears anywhere on the critical path, a system with a 100 ms total budget is guaranteed to fail regardless of how fast every other stage runs. In practice, ML inference is rarely the longest stage; data preprocessing and postprocessing often dominate, making the critical path longer than the model execution time alone suggests. \index{Critical Path!optimization}
|
||
|
||
```{python}
|
||
#| label: lighthouse-setup
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ LIGHTHOUSE MODEL SPECIFICATIONS
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @sec-ml-systems-workload-archetypes-fd10 — "Five Reference Workloads"
|
||
# │ callout; also @sec-ml-systems-system-balance-hardware-96ab
|
||
# │ (ResNet-50 bottleneck example, @tbl-representative-systems).
|
||
# │
|
||
# │ Goal: Provide specs for the five Lighthouse Models (ResNet-50, GPT-2/Llama,
|
||
# │ DLRM, MobileNetV2, KWS DS-CNN).
|
||
# │ Show: Parameter count, FLOP profile, and memory footprint that anchor
|
||
# │ each workload archetype to a concrete system.
|
||
# │ How: Retrieve parameters and FLOPs from Models twin; derive sizes using
|
||
# │ size_in_bytes() with 4-byte (FP32) precision.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (RESNET50_FLOPs, GFLOPs, Mparam, Bparam, etc.)
|
||
# │ Exports: LighthouseModels.resnet_gflops_str, .resnet_params_m_str,
|
||
# │ .resnet_fp32_mb_str, .gpt2_params_b_str, .llama_range_str,
|
||
# │ .dlrm_embedding_str, .mobilenet_flops_reduction_str,
|
||
# │ .mobile_tdp_range_str, .kws_params_str, .kws_size_kb_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim import Models
|
||
from mlsysim.core.constants import (
|
||
RESNET50_FLOPs, GFLOPs, Mparam, Bparam, Kparam, byte, MB, GB, KB
|
||
)
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class LighthouseModels:
|
||
"""
|
||
Namespace for Lighthouse Models statistics.
|
||
Scenario: Quantifying the 5 reference workloads.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
m_resnet = Models.ResNet50
|
||
m_gpt2 = Models.GPT2
|
||
m_llama = Models.Language.Llama2_70B
|
||
m_dlrm = Models.DLRM
|
||
m_mobilenet = Models.MobileNetV2
|
||
m_kws = Models.Tiny.DS_CNN
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
resnet_flops_g = RESNET50_FLOPs.m_as(GFLOPs)
|
||
resnet_params_m = m_resnet.parameters.m_as(Mparam)
|
||
resnet_fp32_mb = m_resnet.size_in_bytes(4 * byte).m_as(MB)
|
||
|
||
gpt2_params_b = m_gpt2.parameters.m_as(Bparam)
|
||
|
||
# Step 1: DLRM Embedding Size
|
||
dlrm_embedding_gb = m_dlrm.model_size.m_as(GB)
|
||
|
||
# MobileNet
|
||
# Step 2: ResNet-50 ~4.1 GFLOPs, MobileNetV2 ~300 MFLOPs
|
||
mobilenet_flops_reduction = 4100 / 300
|
||
|
||
# Step 3: KWS
|
||
kws_params = m_kws.parameters.m_as(Kparam)
|
||
kws_size_kb = m_kws.size_in_bytes(4 * byte).m_as(KB)
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(resnet_fp32_mb >= 90, f"ResNet50 size should be ~98MB, got {resnet_fp32_mb:.0f}MB")
|
||
check(mobilenet_flops_reduction > 10, "MobileNet reduction should be >10x vs ResNet.")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
resnet_gflops_str = fmt(resnet_flops_g, precision=1)
|
||
resnet_params_m_str = fmt(resnet_params_m, precision=1)
|
||
resnet_fp32_mb_str = fmt(resnet_fp32_mb, precision=0)
|
||
|
||
gpt2_params_b_str = fmt(gpt2_params_b, precision=1)
|
||
llama_range_str = "7 to 70" # Llama family range
|
||
|
||
dlrm_embedding_str = fmt(dlrm_embedding_gb, precision=0)
|
||
|
||
mobilenet_flops_reduction_str = fmt(mobilenet_flops_reduction, precision=0)
|
||
mobile_tdp_range_str = "2 to 5" # Standard mobile envelope
|
||
|
||
kws_params_str = f"{int(kws_params)}K"
|
||
kws_size_kb_str = fmt(kws_size_kb, precision=0)
|
||
```
|
||
|
||
::: {.callout-lighthouse title="Five Reference Workloads"}
|
||
|
||
Throughout this book, we use five Lighthouse Models introduced in @sec-introduction—concrete workloads that span the deployment spectrum and isolate distinct system bottlenecks. @sec-network-architectures provides full architectural details and model biographies.
|
||
|
||
| **Lighthouse** | **Archetype** | **Deployment Paradigm** |
|
||
|:---------------------------|:--------------------------|:-------------------------------|
|
||
| **ResNet-50** | Compute Beast | Cloud training, edge inference |
|
||
| **GPT-2 / Llama** | Bandwidth Hog | Cloud inference |
|
||
| **DLRM** | Sparse Scatter | Cloud only (distributed) |
|
||
| **MobileNet** | Compute Beast (efficient) | Mobile, edge |
|
||
| **Keyword Spotting (KWS)** | Tiny Constraint | TinyML, always-on |
|
||
|
||
:::
|
||
|
||
To ground the abstract interdependencies of the Iron Law in concrete practice, we analyze the Lighthouse Models introduced in @sec-introduction. The following summaries recap each workload from a systems perspective, connecting them to the specific Iron Law bottlenecks they exemplify.
|
||
|
||
The first lighthouse, **ResNet-50**\index{ResNet-50!systems characteristics}, classifies images into 1,000 categories, processing each image through approximately `{python} LighthouseModels.resnet_gflops_str` billion floating-point operations using `{python} LighthouseModels.resnet_params_m_str` million parameters (`{python} LighthouseModels.resnet_fp32_mb_str` MB at FP32). Used in medical imaging diagnostics, autonomous vehicle perception pipelines, and as the backbone for content moderation systems, its regular, compute-dense structure makes it the canonical benchmark for hardware accelerator performance.
|
||
|
||
The language models **GPT-2 / Llama**\index{GPT-2!autoregressive bottleneck}\index{Llama!memory-bound inference} power chatbots, code assistants, and content generation tools. These models generate text one token at a time, requiring the model to read its full parameter set (`{python} LighthouseModels.gpt2_params_b_str` billion for GPT-2, `{python} LighthouseModels.llama_range_str` billion for Llama) from memory for each output token. This sequential memory access pattern creates the autoregressive bottleneck that dominates serving costs.
|
||
|
||
The recommendation lighthouse, **DLRM**\index{DLRM!memory capacity bound}\index{recommendation systems!DLRM} (Deep Learning Recommendation Model), powers the "You might also like" recommendations on platforms like Meta and Netflix. It maps users and items to embedding vectors stored in tables that can exceed `{python} LighthouseModels.dlrm_embedding_str` GB, making memory capacity rather than computation the binding constraint.
|
||
|
||
The mobile lighthouse, **MobileNet**\index{MobileNet!depthwise separable convolutions}\index{MobileNet!efficiency gains}, runs in smartphone camera apps for real-time photo categorization and on-device visual search. It performs the same image classification task as ResNet but uses depthwise separable convolutions to reduce computation by `{python} LighthouseModels.mobilenet_flops_reduction_str`$\times$, enabling real-time inference on smartphones at `{python} LighthouseModels.mobile_tdp_range_str` watts.
|
||
|
||
The TinyML lighthouse, **Keyword Spotting (KWS)**\index{Keyword Spotting (KWS)!TinyML archetype}, represents the always-on sensing archetype. Used in applications like Smart Doorbells, it detects wake words ("Ding Dong", "Hello") using a depthwise separable CNN with approximately `{python} LighthouseModels.kws_params_str` parameters (small variants; the DS-CNN benchmark in MLPerf Tiny uses ~200K) fitting in under `{python} LighthouseModels.kws_size_kb_str` KB, running continuously at under 1 milliwatt.
|
||
|
||
The huge range in compute requirements (20 MFLOPs → 4 GFLOPs) and memory (800 KB → 100 GB) explains why no single deployment paradigm fits all workloads. A keyword spotter runs comfortably on a \$2 microcontroller; a recommendation system requires a warehouse-scale computer. These five Lighthouse Models will serve as concrete anchors throughout the book, each isolating a distinct system bottleneck that we will revisit in every chapter.
|
||
|
||
With the analytical tools (Iron Law, Bottleneck Principle, Workload Archetypes) and reference workloads established, we can now apply them to concrete hardware. The next step translates these abstractions into quantitative engineering decisions by examining how system balance—the interplay of compute, memory, and I/O—varies across real hardware platforms.
|
||
|
||
## System Balance and Hardware {#sec-ml-systems-system-balance-hardware-96ab}
|
||
|
||
\index{latency!decision thresholds} \index{latency vs throughput!trade-offs}Physical constraints translate into engineering decisions through concrete numbers. @tbl-latency-numbers provides order-of-magnitude latencies that should inform every deployment decision—spanning eight orders of magnitude from nanosecond compute operations to hundreds of milliseconds for cross-region network calls. Detailed hardware latencies and bandwidth constraints are covered in @sec-hardware-acceleration. The key decision rule: if your latency budget is $X$ ms, you cannot use any operation with latency $> X$ in your critical path[^fn-critical-path-ml].
|
||
|
||
```{python}
|
||
#| label: latency-constants
|
||
#| echo: false
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ LATENCY NUMBERS FOR ML SYSTEM DESIGN
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @tbl-latency-numbers in @sec-ml-systems-system-balance-hardware-96ab
|
||
# │
|
||
# │ Goal: Populate the 14-row latency reference table spanning compute, memory,
|
||
# │ network, and ML operation categories.
|
||
# │ Show: The 8-order-of-magnitude gap from nanosecond register access to
|
||
# │ hundreds-of-milliseconds cross-region network RTT.
|
||
# │ How: Assign representative string constants derived from hardware specs and
|
||
# │ published measurements; no arithmetic required.
|
||
# │
|
||
# │ Imports: (none — display string constants only)
|
||
# │ Exports: LatencyConstants.lat_compute_str, LatencyConstants.lat_npu_str,
|
||
# │ LatencyConstants.lat_llm_str, LatencyConstants.lat_l1_str,
|
||
# │ LatencyConstants.lat_hbm_str, LatencyConstants.lat_dram_str,
|
||
# │ LatencyConstants.lat_net_dc_str, LatencyConstants.lat_net_region_str,
|
||
# │ LatencyConstants.lat_net_cross_str, LatencyConstants.lat_kws_str,
|
||
# │ LatencyConstants.lat_face_str, LatencyConstants.lat_gpt4_str,
|
||
# │ LatencyConstants.lat_train_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class LatencyConstants:
|
||
"""Namespace for Latency Constants."""
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
lat_compute_str = "~1 ns" # GPU matrix multiply (per op)
|
||
lat_npu_str = "5–20 ms" # NPU inference (MobileNet)
|
||
lat_llm_str = "20–100 ms" # LLM token generation
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
lat_l1_str = "~1 ns" # L1 cache hit
|
||
lat_hbm_str = "20–50 ns" # HBM read (GPU)
|
||
lat_dram_str = "50–100 ns" # DRAM read (mobile)
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
lat_net_dc_str = "0.5 ms" # same datacenter
|
||
lat_net_region_str = "1–5 ms" # same region
|
||
lat_net_cross_str = "50–150 ms" # cross-region
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
lat_kws_str = "100 μs" # wake-word detection (TinyML)
|
||
lat_face_str = "10–30 ms" # face detection (mobile)
|
||
lat_gpt4_str = "200–500 ms" # GPT-4 first token
|
||
lat_train_str = "200–400 ms" # ResNet-50 training step
|
||
```
|
||
|
||
These latencies, organized by category in @tbl-latency-numbers, span eight orders of magnitude:
|
||
|
||
| **Operation** | **Latency** | **Deployment Implication** |
|
||
|:---------------------------------|:-----------------------------------------------|:---------------------------------|
|
||
| **Compute** | | |
|
||
| **GPU matrix multiply (per op)** | `{python} LatencyConstants.lat_compute_str` | Compute is rarely the bottleneck |
|
||
| **NPU inference (MobileNet)** | `{python} LatencyConstants.lat_npu_str` | Mobile can do real-time vision |
|
||
| **LLM token generation** | `{python} LatencyConstants.lat_llm_str` | Perceived as "typing speed" |
|
||
| **Memory** | | |
|
||
| **L1 cache hit** | `{python} LatencyConstants.lat_l1_str` | Keep hot data in registers |
|
||
| **HBM read (GPU)** | `{python} LatencyConstants.lat_hbm_str` | 100$\times$ slower than compute |
|
||
| **DRAM read (mobile)** | `{python} LatencyConstants.lat_dram_str` | Memory-bound on most devices |
|
||
| **Network** | | |
|
||
| **Same datacenter** | `{python} LatencyConstants.lat_net_dc_str` | Microservices feasible |
|
||
| **Same region** | `{python} LatencyConstants.lat_net_region_str` | Edge servers viable |
|
||
| **Cross-region** | `{python} LatencyConstants.lat_net_cross_str` | Batch processing only |
|
||
| **ML Operations** | | |
|
||
| **Wake-word detection (TinyML)** | `{python} LatencyConstants.lat_kws_str` | Always-on feasible at <1 mW |
|
||
| **Face detection (mobile)** | `{python} LatencyConstants.lat_face_str` | Real-time at 30 FPS |
|
||
| **GPT-4 first token** | `{python} LatencyConstants.lat_gpt4_str` | User notices delay |
|
||
| **ResNet-50 training step** | `{python} LatencyConstants.lat_train_str` | Throughput-optimized |
|
||
|
||
: **Latency Numbers for ML System Design**\index{latency numbers!deployment constraints}\index{memory hierarchy!latency comparison}: Order-of-magnitude latencies across compute, memory, network, and ML operations that determine deployment feasibility. Spanning eight orders of magnitude, from nanosecond compute operations to hundreds of milliseconds for cross-region network calls, these physical constraints shape architectural decisions. For a comprehensive quick-reference including energy ratios and scaling rules, see @sec-machine-foundations-numbers-know-b531. {#tbl-latency-numbers}
|
||
|
||
We can now ground the four deployment paradigms in concrete hardware. While @tbl-deployment-paradigms-overview defined the paradigms conceptually, @tbl-representative-systems (which appears later in this section, after the System Balance discussion) provides specific devices, processors, and quantitative thresholds that practitioners use to select deployment targets.[^fn-cost-spectrum-ml][^fn-pue-efficiency] The 6-order-of-magnitude range in compute (MW cloud vs. mW TinyML) and cost (\$millions vs. \$10) determines which paradigm can serve a given workload economically.
|
||
|
||
These hardware differences translate directly into performance bottlenecks. To understand which constraint dominates in each paradigm, we apply the **Bottleneck Principle** (@sec-ml-systems-bottleneck-principle-3514) using the pipelined form of the Iron Law from @sec-introduction.
|
||
|
||
::: {.callout-perspective title="System Balance Across Paradigms"}
|
||
|
||
\index{system bottlenecks!dominant constraints}The pipelined form of the **Iron Law of ML Systems** from @sec-introduction-iron-law-ml-systems-c32a states that execution time is bounded by the slowest resource, as @eq-iron-law-extended formalizes:
|
||
|
||
$$T = \max\left( \frac{O}{R_{peak} \cdot \eta}, \frac{D_{vol}}{BW}, \frac{D_{vol}}{BW_{IO}} \right) + L_{lat}$$ {#eq-iron-law-extended}
|
||
|
||
Here, $O$ represents total operations, $R_{peak}$ is peak compute rate, $\eta$ is hardware utilization efficiency, $D_{vol}$ is data volume, $BW$ is memory bandwidth, $BW_{IO}$ is I/O bandwidth (storage or network), and $L_{lat}$ is fixed overhead. The equation identifies which resource—compute, memory, or I/O—limits performance. For a systematic diagnostic guide to identifying these bottlenecks, consult the D·A·M taxonomy\index{D·A·M taxonomy!bottleneck diagnosis} (@sec-dam-taxonomy).
|
||
|
||
The **dominant term varies by paradigm and workload**, changing the optimization strategy entirely:
|
||
|
||
| **Paradigm** | **Dominant Constraint** | **Why** | **Optimization Focus** |
|
||
|:------------------------|:--------------------------|:-----------------------------------------------------------|:---------------------------------------------|
|
||
| **Cloud Training** | $O/R_{peak}$ (Compute) | Abundant memory/network; FLOPS limit throughput | Maximize accelerator utilization, batch size |
|
||
| **Cloud LLM Inference** | $D_{vol}/BW$ (Memory BW) | Autoregressive: ~1 FLOP/byte, memory-bound | KV-caching, quantization, batching |
|
||
| **Edge Inference** | $D_{vol}/BW$ (Memory BW) | Limited HBM; models often memory-bound | Model compression, operator fusion |
|
||
| **Mobile** | Energy (implicit) | Battery = $\int \text{Power} \cdot dt$; thermal throttling | Reduced precision, duty cycling |
|
||
| **TinyML** | $D_{vol}/\text{Capacity}$ | 256 KB total; model must fit on-chip | Extreme compression, binary networks |
|
||
|
||
The same ResNet-50 model is **compute-bound**\index{compute-bound vs memory-bound!training vs inference}\index{roofline model!bottleneck analysis} during cloud training (high batch size, high arithmetic intensity) but **memory-bound** during single-image inference (batch=1, low arithmetic intensity) [@williams2009roofline]. Deployment paradigm selection must account for this shift.
|
||
:::
|
||
|
||
This shift between training and inference is critical to understand. Recall the AI Triad from @sec-introduction: every ML system comprises Data, Algorithm, and Machine. The D·A·M taxonomy (@tbl-dam-phase) shows how each component behaves differently depending on whether the system is training (learning patterns) or serving (applying them).
|
||
|
||
| **Component** | **Training (Mutable)** | **Inference (Immutable)** |
|
||
|:--------------------------------------------------------|:------------------------------------------------------------|:--------------------------------------------------------|
|
||
| **Data**\index{training!data throughput} | Massive throughput: large batches, shuffling, augmentation | Low latency: single samples, freshness, speed |
|
||
| **Algorithm**\index{training!bidirectional computation} | Bidirectional: forward + backward pass, optimizer state | Unidirectional: forward pass only, weights frozen |
|
||
| **Machine**\index{inference!latency optimization} | Throughput-optimized: high-bandwidth clusters, large memory | Latency-optimized: edge devices, inference accelerators |
|
||
|
||
: **D·A·M$\times$ Phase**\index{D·A·M taxonomy!training vs inference}: The same model imposes starkly different demands on Data, Algorithm, and Machine depending on whether the system is training or serving. When bottlenecks shift unexpectedly, check which phase you're optimizing for. {#tbl-dam-phase}
|
||
|
||
The following worked example demonstrates how to apply this analysis quantitatively by comparing *ResNet-50 on cloud vs mobile* deployment targets.
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: resnet-setup
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ RESNET-50 MODEL SIZE SETUP
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Callout "ResNet-50 on Cloud vs Mobile"
|
||
# │
|
||
# │ Goal: Contrast ResNet-50 footprint across precision formats.
|
||
# │ Show: How quantization directly reduces the data volume term of the Iron Law.
|
||
# │ How: Calculate model size in MB for FP32, FP16, and INT8.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (RESNET50_FLOPs, RESNET50_PARAMS), mlsysim.book
|
||
# │ Exports: ResnetSetup.resnet_gflops_str, ResnetSetup.resnet_params_m_str,
|
||
# │ ResnetSetup.resnet_fp32_mb_str, ResnetSetup.resnet_fp16_mb_str,
|
||
# │ ResnetSetup.resnet_int8_mb_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim.core.constants import RESNET50_FLOPs, RESNET50_PARAMS, GFLOPs, Mparam, byte, MB
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class ResnetSetup:
|
||
"""Namespace for Resnet Setup."""
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
resnet_fp32_bytes_value = RESNET50_PARAMS.m_as('param') * 4 * byte # 4 bytes per FP32 param
|
||
resnet_fp16_bytes_value = RESNET50_PARAMS.m_as('param') * 2 * byte # 2 bytes per FP16 param
|
||
resnet_int8_bytes_value = RESNET50_PARAMS.m_as('param') * 1 * byte # 1 byte per INT8 param
|
||
resnet_gflops_value = RESNET50_FLOPs.m_as(GFLOPs)
|
||
resnet_params_m_value = RESNET50_PARAMS.m_as(Mparam)
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
resnet_gflops_str = fmt(resnet_gflops_value, precision=1, commas=False) # e.g. "4.1" GFLOPs
|
||
resnet_params_m_str = fmt(resnet_params_m_value, precision=1, commas=False) # e.g. "25.6" M
|
||
resnet_fp32_mb_str = fmt(resnet_fp32_bytes_value.m_as(MB), precision=0, commas=False) # e.g. "102" MB
|
||
resnet_fp16_mb_str = fmt(resnet_fp16_bytes_value.m_as(MB), precision=0, commas=False) # e.g. "51" MB
|
||
resnet_int8_mb_str = fmt(resnet_int8_bytes_value.m_as(MB), precision=0, commas=False) # e.g. "26" MB
|
||
# Quantity values needed by downstream cells (ResnetCloud, ResnetMobile class bodies)
|
||
```
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: resnet-cloud
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ RESNET-50 CLOUD (A100) BOTTLENECK ANALYSIS
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Callout "ResNet-50 on Cloud vs Mobile" — part (a) Cloud analysis
|
||
# │
|
||
# │ Goal: Identify the performance bottleneck for single-image cloud inference.
|
||
# │ Show: That even massive accelerators (A100) are memory-bound at batch=1.
|
||
# │ How: Apply the Iron Law to compare memory and compute terms for ResNet-50.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (A100_*, RESNET50_*), mlsysim.formulas (calc_bottleneck)
|
||
# │ Exports: ResnetCloud.a100_tflops_str, ResnetCloud.a100_bw_tbs_str,
|
||
# │ ResnetCloud.cloud_compute_str, ResnetCloud.cloud_memory_str,
|
||
# │ ResnetCloud.cloud_ratio_x_str, ResnetCloud.cloud_ai_frac,
|
||
# │ ResnetCloud.cloud_bottleneck_str, ResnetCloud.cloud_compute_frac,
|
||
# │ ResnetCloud.cloud_memory_frac
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim import Hardware
|
||
from mlsysim.core.constants import (
|
||
RESNET50_FLOPs, A100_FLOPS_FP16_TENSOR, A100_MEM_BW,
|
||
TFLOPs, second, TB, byte, flop,
|
||
)
|
||
from mlsysim.core.formulas import calc_bottleneck
|
||
from mlsysim.fmt import sci, fmt_percent, fmt, sci_latex, md_frac
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class ResnetCloud:
|
||
"""Namespace for Resnet Cloud."""
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
h_a100 = Hardware.A100
|
||
cloud_stats = calc_bottleneck(
|
||
ops=RESNET50_FLOPs,
|
||
model_bytes=ResnetSetup.resnet_fp16_bytes_value, # from resnet-setup cell
|
||
device_flops=h_a100.peak_flops,
|
||
device_bw=h_a100.memory_bw,
|
||
)
|
||
a100_tflops_value = h_a100.peak_flops.m_as(TFLOPs / second)
|
||
a100_bw_tbs_value = h_a100.memory_bw.m_as(TB / second)
|
||
cloud_compute_ms_value = cloud_stats["compute_ms"]
|
||
cloud_memory_ms_value = cloud_stats["memory_ms"]
|
||
cloud_ratio_x_value = cloud_stats["ratio"]
|
||
cloud_ai_value = cloud_stats["intensity"]
|
||
cloud_bottleneck_value = cloud_stats["bottleneck"]
|
||
|
||
# --- LaTeX fraction components (for nice rendering) ---
|
||
resnet_flops_latex = sci_latex(RESNET50_FLOPs.to(flop))
|
||
a100_flops_latex = sci_latex(h_a100.peak_flops.to(flop / second))
|
||
resnet_fp16_bytes_latex = sci_latex(ResnetSetup.resnet_fp16_bytes_value.to(byte))
|
||
a100_bw_latex = sci_latex(h_a100.memory_bw.to(byte / second))
|
||
cloud_compute_frac = md_frac(resnet_flops_latex, a100_flops_latex, f"{cloud_compute_ms_value:.3f}", "ms")
|
||
cloud_memory_frac = md_frac(resnet_fp16_bytes_latex, a100_bw_latex, f"{cloud_memory_ms_value:.3f}", "ms")
|
||
cloud_ai_frac = md_frac(resnet_flops_latex, resnet_fp16_bytes_latex, f"{cloud_ai_value:.0f}", "FLOPs/byte")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
a100_tflops_str = fmt(a100_tflops_value, precision=0, commas=False) # e.g. "312" TFLOPS
|
||
a100_bw_tbs_str = fmt(a100_bw_tbs_value, precision=0, commas=False) # e.g. "2" TB/s
|
||
cloud_compute_ms_str = fmt(cloud_compute_ms_value, precision=3, commas=False)
|
||
cloud_memory_ms_str = fmt(cloud_memory_ms_value, precision=3, commas=False)
|
||
cloud_ratio_x_str = fmt(cloud_ratio_x_value, precision=0, commas=False) # memory/compute ratio
|
||
cloud_bottleneck_str = cloud_bottleneck_value # "Memory" or "Compute"
|
||
# Values needed by downstream ResnetMobile class body
|
||
```
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: resnet-mobile
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ RESNET-50 MOBILE (NPU) BOTTLENECK ANALYSIS
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Callout "ResNet-50 on Cloud vs Mobile" — part (b) Mobile analysis
|
||
# │
|
||
# │ Goal: Identify the performance bottleneck for mobile inference.
|
||
# │ Show: That the 40× bandwidth gap, not the 10,000× compute gap, determines performance.
|
||
# │ How: Compare memory and compute terms for ResNet-50 on a mobile NPU.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (MOBILE_NPU_*, A100_MEM_BW), mlsysim.formulas
|
||
# │ Exports: ResnetMobile.mobile_tops_str, ResnetMobile.mobile_bw_gbs_str,
|
||
# │ ResnetMobile.mobile_ratio_x_str, ResnetMobile.mobile_bottleneck_str,
|
||
# │ ResnetMobile.bw_advantage_x_str, ResnetMobile.inference_speed_x_str,
|
||
# │ ResnetMobile.mobile_compute_frac, ResnetMobile.mobile_memory_frac
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim import Hardware, Models
|
||
from mlsysim.core.constants import (
|
||
RESNET50_FLOPs, MOBILE_NPU_TOPS_INT8, MOBILE_NPU_MEM_BW, A100_MEM_BW,
|
||
TFLOPs, second, GB, byte, flop,
|
||
)
|
||
from mlsysim.core.formulas import calc_bottleneck
|
||
from mlsysim.fmt import sci_latex, md_frac, fmt_percent, fmt
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class ResnetMobile:
|
||
"""Namespace for Resnet Mobile."""
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
h_phone = Hardware.Edge.Generic_Phone
|
||
m_resnet = Models.ResNet50
|
||
h_a100 = Hardware.A100
|
||
|
||
mobile_stats = calc_bottleneck(
|
||
ops=m_resnet.inference_flops,
|
||
model_bytes=ResnetSetup.resnet_int8_bytes_value, # from resnet-setup cell
|
||
device_flops=h_phone.peak_flops,
|
||
device_bw=h_phone.memory_bw,
|
||
)
|
||
mobile_tops_value = h_phone.peak_flops.m_as(TFLOPs / second)
|
||
mobile_bw_gbs_value = h_phone.memory_bw.m_as(GB / second)
|
||
mobile_compute_ms_value = mobile_stats["compute_ms"]
|
||
mobile_memory_ms_value = mobile_stats["memory_ms"]
|
||
mobile_ratio_x_value = mobile_stats["ratio"]
|
||
mobile_bottleneck_value = mobile_stats["bottleneck"]
|
||
|
||
# --- Cross-platform comparison ---
|
||
bw_advantage_x_value = h_a100.memory_bw / h_phone.memory_bw
|
||
inference_speed_x_value = mobile_memory_ms_value / ResnetCloud.cloud_stats["memory_ms"] # uses cloud_stats
|
||
|
||
# --- LaTeX fraction components (for nice rendering) ---
|
||
mobile_npu_flops_latex = sci_latex(h_phone.peak_flops.to(flop / second))
|
||
resnet_int8_bytes_latex = sci_latex(ResnetSetup.resnet_int8_bytes_value.to(byte))
|
||
mobile_npu_bw_latex = sci_latex(h_phone.memory_bw.to(byte / second))
|
||
mobile_compute_frac = md_frac(ResnetCloud.resnet_flops_latex, mobile_npu_flops_latex, f"{mobile_compute_ms_value:.2f}", "ms")
|
||
mobile_memory_frac = md_frac(resnet_int8_bytes_latex, mobile_npu_bw_latex, f"{mobile_memory_ms_value:.2f}", "ms")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
mobile_tops_str = fmt(mobile_tops_value, precision=0, commas=False) # e.g. "10" TOPS
|
||
mobile_bw_gbs_str = fmt(mobile_bw_gbs_value, precision=0, commas=False) # e.g. "50" GB/s
|
||
mobile_ratio_x_str = fmt(mobile_ratio_x_value, precision=0, commas=False) # memory/compute ratio
|
||
mobile_bottleneck_str = mobile_bottleneck_value # "Memory" or "Compute"
|
||
bw_advantage_x_str = fmt(bw_advantage_x_value, precision=0, commas=False) # A100 vs NPU bandwidth
|
||
inference_speed_x_str = fmt(inference_speed_x_value, precision=0, commas=False) # latency ratio
|
||
```
|
||
|
||
::: {.callout-notebook title="ResNet-50 on Cloud vs Mobile"}
|
||
|
||
\index{ResNet-50!cloud vs mobile}\index{arithmetic intensity!bottleneck analysis}
|
||
\index{arithmetic intensity!cloud vs mobile}
|
||
**Problem**: Determine whether ResNet-50 inference is compute-bound or memory-bound on (a) a high-end datacenter GPU (NVIDIA A100 class) and (b) a flagship mobile NPU (Apple/Qualcomm class).
|
||
|
||
**Given** (from Lighthouse Models):
|
||
|
||
- ResNet-50: `{python} ResnetSetup.resnet_gflops_str` GFLOPs per inference, `{python} ResnetSetup.resnet_params_m_str` M parameters (`{python} ResnetSetup.resnet_fp32_mb_str` MB at FP32, `{python} ResnetSetup.resnet_fp16_mb_str` MB at FP16)
|
||
|
||
**Analysis**:
|
||
|
||
**(a) Cloud: NVIDIA A100 (batch=1, FP16)**
|
||
|
||
- Peak compute: `{python} ResnetCloud.a100_tflops_str` TFLOPS (FP16)
|
||
- Memory bandwidth: `{python} ResnetCloud.a100_bw_tbs_str` TB/s (HBM2e)
|
||
- Compute time: $T_{\text{comp}}$ = `{python} ResnetCloud.cloud_compute_frac`
|
||
- Memory time: $T_{\text{mem}}$ = `{python} ResnetCloud.cloud_memory_frac`
|
||
- **Bottleneck**: `{python} ResnetCloud.cloud_bottleneck_str` (`{python} ResnetCloud.cloud_ratio_x_str`$\times$ slower than compute)
|
||
- **Arithmetic Intensity**: `{python} ResnetCloud.cloud_ai_frac` — this ratio of compute operations to bytes loaded measures how efficiently a workload uses the hardware. When arithmetic intensity exceeds the hardware's *compute-to-bandwidth ratio* ($R_{peak}/BW$), the workload is compute-bound; below it, the workload is memory-bound. For single-image inference, the low batch size yields low arithmetic intensity, explaining why even powerful GPUs are memory-bound at batch=1.
|
||
|
||
**(b) Mobile: Flagship NPU (batch=1, INT8)**
|
||
|
||
- Peak compute: ~`{python} ResnetMobile.mobile_tops_str` TOPS (INT8) — representative of modern mobile NPUs
|
||
- Memory bandwidth: ~`{python} ResnetMobile.mobile_bw_gbs_str` GB/s (LPDDR5)
|
||
- Model size: `{python} ResnetSetup.resnet_int8_mb_str` MB (INT8 quantized)
|
||
- Compute time: $T_{\text{comp}}$ = `{python} ResnetMobile.mobile_compute_frac`
|
||
- Memory time: $T_{\text{mem}}$ = `{python} ResnetMobile.mobile_memory_frac`
|
||
- **Bottleneck**: `{python} ResnetMobile.mobile_bottleneck_str` (`{python} ResnetMobile.mobile_ratio_x_str`$\times$ slower than compute)
|
||
|
||
**Key Insight**\index{quantization!deployment benefits}: Both platforms are memory-bound for single-image inference! The A100's faster memory bandwidth (`{python} ResnetCloud.a100_bw_tbs_str` TB/s vs `{python} ResnetMobile.mobile_bw_gbs_str` GB/s = `{python} ResnetMobile.bw_advantage_x_str`$\times$) translates to roughly `{python} ResnetMobile.inference_speed_x_str`$\times$ faster inference, not the 10,000$\times$ compute advantage. This explains why quantization (reducing bytes) often beats faster hardware (increasing FLOPS) for deployment.
|
||
|
||
**When does ResNet-50 become compute-bound?** Increase batch size until $\frac{\text{Ops}}{\text{Compute}} > \frac{\text{Bytes}}{\text{Memory BW}}$. On A100, this occurs around batch=64, where activations dominate memory traffic and high arithmetic intensity is sustained.
|
||
:::
|
||
|
||
As systems transition from Cloud to Edge to TinyML, available resources decrease dramatically. @tbl-representative-systems quantifies this progression with concrete hardware examples: memory drops from 131 TB (cloud) to 520 KB (TinyML), a 250 million-fold reduction, while power budgets span nine orders of magnitude from megawatts to milliwatts[^fn-cost-spectrum-ml]. This resource disparity is most acute on microcontrollers, the primary hardware platform for TinyML, where memory and storage capacities are insufficient for conventional ML models.
|
||
|
||
[^fn-cost-spectrum-ml]: **ML Hardware Cost Spectrum**: AI infrastructure spans six orders of magnitude in cost, from \$10 microcontrollers to multi-million-dollar GPU clusters. This 100,000$\times$ range means deployment paradigm selection is simultaneously a physics decision and an economics decision: the same accuracy target may be achievable on a \$2 microcontroller (via aggressive quantization) or a \$30,000 GPU (at full precision), with fundamentally different latency, power, and operational cost profiles. \index{Hardware Cost!deployment spectrum}
|
||
|
||
[^fn-pue-efficiency]: **Power Usage Effectiveness (PUE)**: This metric isolates the energy overhead (e.g., cooling) that determines the economic viability of the "MW cloud" paradigm. For a datacenter, the remaining 6% overhead of an elite 1.06 PUE still translates to megawatts of non-compute cost. This entire cost category does not exist for the "mW TinyML" paradigm, explaining a key part of the 6-order-of-magnitude economic range. \index{PUE!efficiency overhead}
|
||
|
||
```{python}
|
||
#| label: mobile-hardware-specs
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ MOBILE HARDWARE SPECS: RAM, STORAGE, BANDWIDTH, NPU RANGES
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @tbl-representative-systems (Mobile ML row, ~30 lines below),
|
||
# │ @tbl-deployment-thresholds (Mobile ML row),
|
||
# │ @sec-ml-systems-mobile-ml-benefits-resource-constraints-c568.
|
||
# │
|
||
# │ Goal: Provide mobile-specific hardware range strings for tables and prose.
|
||
# │ Show: RAM, storage, NPU TOPS, and memory bandwidth ranges for smartphones.
|
||
# │ How: Read ranges from centralized constants; derive bandwidth range from
|
||
# │ phone hardware twin.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (MOBILE_RAM_RANGE_GB, MOBILE_STORAGE_RANGE),
|
||
# │ mlsysim.Hardware (Generic_Phone for bandwidth)
|
||
# │ Exports: MobileHardwareSpecs.mobile_ram_range_str,
|
||
# │ MobileHardwareSpecs.mobile_storage_range_str,
|
||
# │ MobileHardwareSpecs.mobile_npu_range_str,
|
||
# │ MobileHardwareSpecs.mobile_bw_range_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim import Hardware
|
||
from mlsysim.core.constants import MOBILE_RAM_RANGE_GB, MOBILE_STORAGE_RANGE
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class MobileHardwareSpecs:
|
||
"""Namespace for mobile hardware specification ranges."""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
h_phone = Hardware.Edge.Generic_Phone
|
||
mobile_ram_range = MOBILE_RAM_RANGE_GB
|
||
mobile_storage_range = MOBILE_STORAGE_RANGE
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
mobile_bw_range = f"{int(h_phone.memory_bw.m_as('GB/s')/2)}-{int(h_phone.memory_bw.m_as('GB/s'))}"
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
mobile_ram_range_str = mobile_ram_range
|
||
mobile_storage_range_str = mobile_storage_range
|
||
mobile_bw_range_str = mobile_bw_range
|
||
mobile_npu_range_str = "1-10"
|
||
```
|
||
|
||
```{python}
|
||
#| label: hardware-spectrum-setup
|
||
#| echo: false
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ HARDWARE SPECTRUM: REPRESENTATIVE SYSTEMS TABLE
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @tbl-representative-systems in @sec-ml-systems-system-balance-hardware-96ab
|
||
# │ and the deployment decision thresholds table that follows it.
|
||
# │
|
||
# │ Goal: Ground abstract deployment paradigms in concrete hardware specs for
|
||
# │ TPU v4 Pod (Cloud), DGX Spark (Edge), and ESP32-CAM (TinyML).
|
||
# │ Show: The 9-order-of-magnitude power gap (4 MW cloud to 0.1 W TinyML)
|
||
# │ and 8-order-of-magnitude cost gap ($millions to $10) across tiers.
|
||
# │ How: Read memory, power, and cost from mlsysim.core.constants for each platform;
|
||
# │ assign threshold strings for the decision boundary table.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (TPU_POD_MEM, TPU_POD_POWER, TPU_POD_CHIPS,
|
||
# │ DGX_RAM, DGX_STORAGE, DGX_POWER, DGX_PRICE_MIN, DGX_PRICE_MAX,
|
||
# │ ESP32_RAM, ESP32_FLASH, ESP32_POWER_MIN, ESP32_POWER_MAX, ESP32_PRICE)
|
||
# │ Exports: HardwareSpectrumSetup.tpu_chips_str,
|
||
# │ HardwareSpectrumSetup.cloud_mem_tb_str,
|
||
# │ HardwareSpectrumSetup.cloud_pwr_mw_str,
|
||
# │ HardwareSpectrumSetup.edge_mem_gb_str,
|
||
# │ HardwareSpectrumSetup.edge_stor_tb_str,
|
||
# │ HardwareSpectrumSetup.edge_pwr_w_str,
|
||
# │ HardwareSpectrumSetup.edge_price_min_str,
|
||
# │ HardwareSpectrumSetup.edge_price_max_str,
|
||
# │ HardwareSpectrumSetup.tiny_ram_kb_str,
|
||
# │ HardwareSpectrumSetup.tiny_flash_mb_str,
|
||
# │ HardwareSpectrumSetup.tiny_pwr_min_str,
|
||
# │ HardwareSpectrumSetup.tiny_pwr_max_str,
|
||
# │ HardwareSpectrumSetup.tiny_price_str,
|
||
# │ HardwareSpectrumSetup.cloud_thresh_tflops_str,
|
||
# │ HardwareSpectrumSetup.edge_thresh_pflops_str,
|
||
# │ HardwareSpectrumSetup.tiny_thresh_tops_str,
|
||
# │ HardwareSpectrumSetup.tiny_thresh_mw_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim.core.constants import (
|
||
TPU_POD_MEM, TPU_POD_POWER, TPU_POD_CHIPS,
|
||
DGX_RAM, DGX_STORAGE, DGX_POWER, DGX_PRICE_MIN, DGX_PRICE_MAX,
|
||
ESP32_RAM, ESP32_FLASH, ESP32_POWER_MIN, ESP32_POWER_MAX, ESP32_PRICE,
|
||
TB, GB, KiB, MB, watt, USD
|
||
)
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class HardwareSpectrumSetup:
|
||
"""Namespace for Hardware Spectrum Setup."""
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
tpu_chips_str = f"{TPU_POD_CHIPS:,}" # e.g. "4,096" chips
|
||
cloud_mem_tb_str = fmt(TPU_POD_MEM.m_as(TB), precision=0, commas=False) # e.g. "131" TB
|
||
cloud_pwr_mw_str = fmt(TPU_POD_POWER.m_as("megawatt"), precision=0, commas=False) # e.g. "4" MW
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
edge_mem_gb_str = fmt(DGX_RAM.m_as(GB), precision=0, commas=False) # e.g. "128" GB
|
||
edge_stor_tb_str = fmt(DGX_STORAGE.m_as(TB), precision=0, commas=False) # e.g. "4" TB
|
||
edge_pwr_w_str = fmt(DGX_POWER.m_as(watt), precision=0, commas=False) # e.g. "500" W
|
||
edge_price_min_str = f"{DGX_PRICE_MIN.m_as(USD):,.0f}" # e.g. "3,000"
|
||
edge_price_max_str = f"{DGX_PRICE_MAX.m_as(USD):,.0f}" # e.g. "5,000"
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
tiny_ram_kb_str = fmt(ESP32_RAM.m_as(KiB), precision=0, commas=False) # e.g. "520" KB
|
||
tiny_flash_mb_str = fmt(ESP32_FLASH.m_as(MB), precision=0, commas=False) # e.g. "4" MB
|
||
tiny_pwr_min_str = f"{ESP32_POWER_MIN.m_as(watt)}" # e.g. "0.1" W
|
||
tiny_pwr_max_str = f"{ESP32_POWER_MAX.m_as(watt)}" # e.g. "0.5" W
|
||
tiny_price_str = f"{ESP32_PRICE.m_as(USD)}" # e.g. "10" USD
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
cloud_thresh_tflops_str = "1000" # TFLOPS threshold for cloud
|
||
cloud_thresh_bw_str = "100" # GB/s memory bandwidth
|
||
edge_thresh_pflops_str = "1" # PFLOPS AI compute threshold
|
||
edge_thresh_bw_str = "270" # GB/s memory bandwidth
|
||
tiny_thresh_tops_str = "1" # TOPS compute threshold
|
||
tiny_thresh_mw_str = "1" # mW power threshold
|
||
```
|
||
|
||
\index{hardware spectrum!resource progression}
|
||
@tbl-representative-systems grounds these paradigms in concrete hardware platforms and price points:
|
||
|
||
\begingroup\small
|
||
|
||
| **Category** | **Example Device** | **Processor** | **Memory** | **Storage** | **Power** | **Price Range** |
|
||
|:--------------|:--------------------|----------------------------------------------------------------------:|------------------------------------------------------------:|:------------------------------------------------------------|------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------|
|
||
| **Cloud ML** | Google TPU v4 Pod | `{python} HardwareSpectrumSetup.tpu_chips_str` TPU v4 chips, >1 EFLOP | `{python} HardwareSpectrumSetup.cloud_mem_tb_str` TB HBM2 | Cloud-scale (PB) | ~`{python} HardwareSpectrumSetup.cloud_pwr_mw_str` MW | Cloud service (rental) |
|
||
| **Edge ML** | NVIDIA DGX Spark | GB10 Grace Blackwell, 1 PFLOPS AI | `{python} HardwareSpectrumSetup.edge_mem_gb_str` GB LPDDR5x | `{python} HardwareSpectrumSetup.edge_stor_tb_str` TB NVMe | ~`{python} HardwareSpectrumSetup.edge_pwr_w_str` W | ~\$`{python} HardwareSpectrumSetup.edge_price_min_str`–`{python} HardwareSpectrumSetup.edge_price_max_str` |
|
||
| **Mobile ML** | Flagship Smartphone | Mobile SoC (CPU + GPU + NPU) | `{python} MobileHardwareSpecs.mobile_ram_range_str` GB RAM | `{python} MobileHardwareSpecs.mobile_storage_range_str` | `{python} LighthouseModels.mobile_tdp_range_str` W | USD 999+ |
|
||
| **TinyML** | ESP32-CAM | Dual-core @ 240 MHz | `{python} HardwareSpectrumSetup.tiny_ram_kb_str` KB RAM | `{python} HardwareSpectrumSetup.tiny_flash_mb_str` MB Flash | `{python} HardwareSpectrumSetup.tiny_pwr_min_str`–`{python} HardwareSpectrumSetup.tiny_pwr_max_str` W | \$`{python} HardwareSpectrumSetup.tiny_price_str` |
|
||
|
||
: **Hardware Spectrum (Concrete Platforms)**\index{hardware spectrum!deployment platforms}\index{domain-specific accelerators!datacenter scale}\index{workstation-class accelerators!edge deployment}: Representative devices that instantiate each deployment paradigm from @tbl-deployment-paradigms-overview. Where the conceptual table defines operating regimes, this table provides the specific processors, memory capacities, power envelopes, and price points that practitioners use to match workloads to hardware. The DGX Spark sits at the high end of the edge spectrum; most edge deployments use far smaller devices (e.g., Jetson Orin Nano). We include it to illustrate the *ceiling* of non-cloud deployment. {#tbl-representative-systems}
|
||
|
||
\endgroup
|
||
|
||
| **Paradigm** | **Compute** | **Memory BW** | **Power** | **Latency** |
|
||
|:--------------|:-------------------------------------------------------------------|:-----------------------------------------------------------|:--------------------------------------------------------|:-------------------------------------------------------|
|
||
| **Cloud ML** | >`{python} HardwareSpectrumSetup.cloud_thresh_tflops_str` TFLOPS | >`{python} HardwareSpectrumSetup.cloud_thresh_bw_str` GB/s | PUE 1.1–1.3 | 100–500 ms |
|
||
| **Edge ML** | ~`{python} HardwareSpectrumSetup.edge_thresh_pflops_str` PFLOPS AI | >`{python} HardwareSpectrumSetup.edge_thresh_bw_str` GB/s | 100s W | `{python} MLSystemsSetup.edge_latency_range_str` ms |
|
||
| **Mobile ML** | `{python} MobileHardwareSpecs.mobile_npu_range_str` TOPS | `{python} MobileHardwareSpecs.mobile_bw_range_str` GB/s | <2 W | <`{python} MLSystemsSetup.mobile_latency_range_str` ms |
|
||
| **TinyML** | <`{python} HardwareSpectrumSetup.tiny_thresh_tops_str` TOPS | — | <`{python} HardwareSpectrumSetup.tiny_thresh_mw_str` mW | µs |
|
||
|
||
: **Deployment Decision Thresholds**: Quantitative thresholds that practitioners use to determine deployment feasibility for each paradigm in @tbl-representative-systems. These numbers answer the practical question "can my workload run here?" by specifying the compute, memory bandwidth, and power envelope that each paradigm provides. {#tbl-deployment-thresholds}
|
||
|
||
These deployment paradigms emerged from decades of hardware evolution, from floating-point coprocessors in the 1980s through graphics processors in the 2000s to today's domain-specific AI accelerators. @sec-hardware-acceleration traces this historical progression and the architectural principles that drove it. Here, we focus on the *consequences* of this evolution: the deployment spectrum that results from having qualitatively different hardware available at different points in the infrastructure.
|
||
|
||
Each paradigm occupies a distinct region of the deployment spectrum, governed by the physical constraints (Light Barrier, Power Wall, Memory Wall) and quantified by the analytical tools (Iron Law, Bottleneck Principle) introduced above. The quantitative thresholds in @tbl-deployment-thresholds help practitioners determine which paradigm suits their workload. The following four sections progress from cloud to TinyML, tracing the gradient from maximum computational resources to maximum efficiency constraints.
|
||
|
||
Each section follows a consistent structure: definition, key characteristics, benefits and trade-offs, and representative applications. This parallel treatment reveals both what distinguishes each paradigm and what principles they share, setting the stage for the hybrid architectures that combine them. We begin at the resource-rich end of the spectrum and progressively tighten the constraints.
|
||
|
||
## Cloud ML: Computational Power {#sec-ml-systems-cloud-ml-maximizing-computational-power-a338}
|
||
|
||
```{python}
|
||
#| label: gpt3-training-scale
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ GPT-3 TRAINING SCALE: CLOUD ML OPENING EXAMPLE
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Opening paragraph of the Cloud ML section
|
||
# │ (immediately below).
|
||
# │
|
||
# │ Goal: Compute GPT-3 petaflop-days and provide training cost/scale stats.
|
||
# │ Show: 3,640 PF-days, 10,000 V100s, 15 days, ~$4.6M cost.
|
||
# │ How: Derive petaflop-days from Models.GPT3.training_ops; format
|
||
# │ GPU count, duration, and cost from known values.
|
||
# │
|
||
# │ Imports: mlsysim.Models (GPT3), mlsysim.constants (PFLOPs, SEC_PER_DAY, ureg),
|
||
# │ mlsysim.book (fmt, check)
|
||
# │ Exports: GPT3TrainingScale.gpt3_petaflop_days_str,
|
||
# │ GPT3TrainingScale.gpt3_v100_count_str,
|
||
# │ GPT3TrainingScale.gpt3_days_str,
|
||
# │ GPT3TrainingScale.gpt3_cost_m_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim import Models
|
||
from mlsysim.core.constants import PFLOPs, SEC_PER_DAY, ureg
|
||
from mlsysim.fmt import fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class GPT3TrainingScale:
|
||
"""Namespace for GPT-3 training scale statistics."""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
m_gpt3 = Models.GPT3
|
||
gpt3_days = 15
|
||
gpt3_cost_m = 4.6
|
||
gpt3_v100_count = 10000
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
gpt3_petaflop_days = (m_gpt3.training_ops / (PFLOPs * SEC_PER_DAY)).to_base_units().m_as(ureg.dimensionless)
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(gpt3_petaflop_days >= 3000, f"GPT-3 training should be >=3000 PF-days, got {gpt3_petaflop_days:.0f}")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
gpt3_petaflop_days_str = fmt(gpt3_petaflop_days, precision=0, commas=True)
|
||
gpt3_v100_count_str = fmt(gpt3_v100_count, precision=0, commas=True)
|
||
gpt3_days_str = fmt(gpt3_days, precision=0)
|
||
gpt3_cost_m_str = fmt(gpt3_cost_m, precision=1)
|
||
```
|
||
|
||
\index{Cloud ML!datacenter scale} \index{data centers!ML infrastructure}
|
||
\index{Cloud ML!workload archetypes}
|
||
Consider what it took to train GPT-3: `{python} GPT3TrainingScale.gpt3_petaflop_days_str` petaflop-days of computation, `{python} GPT3TrainingScale.gpt3_v100_count_str` GPUs running for approximately `{python} GPT3TrainingScale.gpt3_days_str` days, consuming megawatts of power—at an estimated cost of ~$`{python} GPT3TrainingScale.gpt3_cost_m_str`M[^fn-nlp-training-scale]. No smartphone, no edge server, no single machine on Earth could have performed this computation. Only a datacenter, with its virtually unlimited compute, memory, and storage, could aggregate enough resources to make this possible. This is the defining proposition of Cloud ML: if you can tolerate latency, you can access computational scale that no other paradigm can match.
|
||
|
||
Cloud ML aggregates computational resources in data centers[^fn-cloud-utility-model] to handle computationally intensive tasks: large-scale data processing, collaborative model development, and advanced analytics. This infrastructure serves as the natural home for three of the four Workload Archetypes: Compute Beasts like ResNet training that demand sustained TFLOPS across thousands of accelerators, Bandwidth Hogs like large language model inference that benefit from TB/s HBM bandwidth, and Sparse Scatter workloads like recommendation systems that require terabytes of embedding tables and high-bandwidth interconnects for all-to-all communication patterns.
|
||
|
||
Cloud deployments range from single-machine instances (workstations, multi-GPU servers, DGX systems) to large-scale distributed systems spanning multiple data centers. This book focuses on single-machine cloud systems, where the reader learns to build and optimize ML systems on individual powerful machines. Future studies can address distributed cloud infrastructure, where systems coordinate computation across multiple networked machines. This follows the principle of establishing foundations before adding complexity.
|
||
|
||
[^fn-cloud-utility-model]: **Cloud as Utility Computing**: The utility model allows providers to offer a specialized hardware portfolio that is economically infeasible for a single organization to maintain. This provides direct, on-demand access to the specific architectures required by each workload archetype: dense accelerator pods for Compute Beasts, HBM-equipped nodes for Bandwidth Hogs, and high-memory systems with fast interconnects for Sparse Scatter. A team can therefore rent a purpose-built, $10M+ supercomputing pod for a few hours rather than owning it. \index{Cloud Infrastructure!utility model}
|
||
|
||
[^fn-nlp-training-scale]: **LLM Training Scale**: GPT-3 required approximately 3,640 petaflop-days, 10,000 V100 GPUs, and an estimated \$4.6M in compute at 2020 cloud rates. This scale illustrates the core Cloud ML trade-off: only centralized infrastructure can aggregate enough $R_{peak}$ for peta-scale training, but the resulting $L_{lat}$ penalty (100--500 ms network round trip) makes that same infrastructure unsuitable for real-time inference. \index{LLM!training scale}
|
||
|
||
What unifies these diverse cloud workloads is a single defining trade-off:
|
||
|
||
::: {.callout-definition title="Cloud ML"}
|
||
|
||
***Cloud Machine Learning***\index{Cloud ML!definition} is the deployment paradigm that optimizes for **Resource Elasticity** by decoupling computational capacity from physical location.
|
||
|
||
1. **Significance (Quantitative):** It enables systems to scale resources ($R_{peak}$) proportional to workload variance, allowing for bursts of peta-flops that would be economically unfeasible to maintain locally.
|
||
2. **Distinction (Durable):** Unlike **Edge ML**, which prioritizes **Data Locality**, Cloud ML prioritizes **Computational Density** and centralized management.
|
||
3. **Common Pitfall:** A frequent misconception is that Cloud ML is "unlimited compute." In reality, it is constrained by the **Distance Penalty** ($L_{lat}$) and the **Ingestion Bottleneck** ($BW$), making it unsuitable for sub-10ms real-time control loops.
|
||
|
||
:::
|
||
|
||
@fig-cloud-ml breaks down Cloud ML across several dimensions that define its computational paradigm. The **Characteristics** branch emphasizes centralization and dynamic scalability, which directly enables the **Benefits** of scalable data processing and global accessibility. This centralization, however, creates the **Challenges** of latency and internet dependence, shaping the kinds of **Examples** that thrive in the cloud: virtual assistants, recommendation systems, and fraud detection. The most fundamental of these challenges, network latency, is not an engineering limitation but a physics constraint. A quick calculation of the distance penalty after the figure makes this concrete.
|
||
|
||
::: {#fig-cloud-ml fig-env="figure" fig-pos="t" fig-cap="**Cloud ML Decomposition.** Characteristics, benefits, challenges, and representative applications of cloud machine learning, where centralized infrastructure and specialized hardware address scale, complexity, and resource management for large datasets and complex computations." fig-alt="Tree diagram with Cloud ML branching to four categories: Characteristics, Benefits, Challenges, and Examples. Each lists items like computational power, scalability, vendor lock-in, and virtual assistants."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
|
||
\tikzset{
|
||
Box/.style={inner xsep=2pt,
|
||
draw=GreenLine,
|
||
fill=GreenL!50,
|
||
node distance=0.4,
|
||
line width=0.75pt,
|
||
anchor=west,
|
||
text width=30mm,align=flush center,
|
||
minimum width=30mm, minimum height=9.5mm
|
||
},
|
||
Box2/.style={Box,draw=BlueLine,fill=BlueL!50, text width=27mm, minimum width=27mm
|
||
},
|
||
Box3/.style={Box,draw=OrangeLine,fill=OrangeL!40, text width=38mm, minimum width=38mm
|
||
},
|
||
Box4/.style={Box,draw=VioletLine,fill=VioletL2!40, text width=32mm, minimum width=32mm
|
||
},
|
||
Line/.style={line width=1.0pt,black!50,text=black,-{Triangle[width=0.8*6pt,length=0.98*6pt]}},
|
||
}
|
||
\node[Box4, fill=VioletL2!90!violet!50,](B1){Characteristics};
|
||
\node[Box2,right=2 of B1,fill=BlueL](B2){Benefits};
|
||
\node[Box,right=2 of B2,fill=GreenL](B3){Challenges};
|
||
\node[Box3,right=2 of B3,fill=OrangeL](B4){Examples};
|
||
\node[Box,draw=OliveLine,fill=OliveL!30, minimum height=11.5mm,
|
||
above=1of $(B2.north east)!0.5!(B3.north west)$](B0){Cloud ML};
|
||
%
|
||
\node[Box4,below=0.7 of B1](B11){Immense Computational Power};
|
||
\node[Box4,below=of B11](B12){Collaborative Environment};
|
||
\node[Box4,below=of B12](B13){Access to Advanced Tools};
|
||
\node[Box4,below=of B13](B14){Dynamic Scalability};
|
||
\node[Box4,below=of B14](B15){Centralized Infrastructure};
|
||
%
|
||
\node[Box2,below=0.7 of B2](B21){Scalable Data Processing and Model Training};
|
||
\node[Box2,below=of B21](B22){Collaboration and Resource Sharing};
|
||
\node[Box2,below=of B22](B23){Flexible Deployment and Accessibility};
|
||
\node[Box2,below=of B23](B24){Cost-Effectiveness and Scalability};
|
||
\node[Box2,below=of B24](B25){Global Accessibility};
|
||
%
|
||
\node[Box,below=0.7 of B3](B31){Vendor Lock-In};
|
||
\node[Box,below=of B31](B32){Latency Issues};
|
||
\node[Box,below=of B32](B33){Data Privacy and Security};
|
||
\node[Box,below=of B33](B34){Dependency on Internet};
|
||
\node[Box,below=of B34](B35){Cost Considerations};
|
||
%
|
||
\node[Box3,below=0.7 of B4](B41){Virtual Assistants};
|
||
\node[Box3,below=of B41](B42){Security and Anomaly Detection};
|
||
\node[Box3,below=of B42](B43){Recommendation Systems};
|
||
\node[Box3,below=of B43](B44){Fraud Detection};
|
||
\node[Box3,below=of B44](B45){Personalized User Experience};
|
||
%
|
||
\foreach \i in{1,2,3,4,5}{
|
||
\foreach \x in{1,2,3,4}{
|
||
\draw[Line](B\x.west)--++(180:0.5)|-(B\x\i);
|
||
}
|
||
}
|
||
\foreach \x in{1,2,3,4}{
|
||
\draw[Line](B0)-|(B\x);
|
||
}
|
||
\end{tikzpicture}
|
||
|
||
```
|
||
:::
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: distance-penalty
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ DISTANCE PENALTY CALCULATION
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: "The Distance Penalty" callout in
|
||
# │ @sec-ml-systems-cloud-ml-tradeoffs-constraints-96ed
|
||
# │
|
||
# │ Goal: Demonstrate why cloud inference is physically impossible for a
|
||
# │ 10 ms safety-critical response budget at 1,500 km distance.
|
||
# │ Show: That speed-of-light RTT alone (15 ms) already exceeds the 10 ms
|
||
# │ budget, leaving a −5 ms deficit before any computation begins.
|
||
# │ How: Apply calc_network_latency_ms() using SPEED_OF_LIGHT_FIBER_KM_S;
|
||
# │ subtract RTT from budget to get deficit.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (SPEED_OF_LIGHT_FIBER_KM_S),
|
||
# │ mlsysim.formulas (calc_network_latency_ms), mlsysim.book (fmt, check)
|
||
# │ Exports: DistancePenalty.sol_kms_str, DistancePenalty.rtt_formatted_str,
|
||
# │ DistancePenalty.deficit_str, DistancePenalty.distance_km_str,
|
||
# │ DistancePenalty.safety_budget_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim.core.constants import SPEED_OF_LIGHT_FIBER_KM_S
|
||
from mlsysim.core.formulas import calc_network_latency_ms
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class DistancePenalty:
|
||
"""Namespace for Distance Penalty."""
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
distance_km_value = 1500 # km to cloud datacenter
|
||
safety_budget_ms_value = 10 # ms safety requirement
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
round_trip_ms_value = calc_network_latency_ms(distance_km_value)
|
||
deficit_ms_value = safety_budget_ms_value - round_trip_ms_value
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
sol_kms_str = f"{SPEED_OF_LIGHT_FIBER_KM_S.m_as('km/s'):,.0f}" # e.g. "200,000" km/s
|
||
rtt_formatted_str = fmt(round_trip_ms_value, precision=0, commas=False) # e.g. "15" ms
|
||
deficit_str = fmt(deficit_ms_value, precision=0, commas=False) # e.g. "-5" ms
|
||
distance_km_str = f"{distance_km_value:,}" # e.g. "1,500" km
|
||
safety_budget_str = f"{safety_budget_ms_value}" # "10" ms
|
||
```
|
||
|
||
::: {.callout-notebook title="The Distance Penalty"}
|
||
|
||
\index{distance penalty!cloud latency} \index{Light Barrier!safety-critical systems}**Problem**: You are deploying a real-time safety monitor for a robotic arm. The safety logic requires a **`{python} DistancePenalty.safety_budget_str` ms** end-to-end response time to prevent injury. Your model runs in a high-performance cloud data center `{python} DistancePenalty.distance_km_str` km away.
|
||
|
||
**The Physics**:
|
||
|
||
1. **Light in Fiber**: ~`{python} DistancePenalty.sol_kms_str` km/s.
|
||
2. **Round-trip Propagation**: (`{python} DistancePenalty.distance_km_str` km$\times$ 2) / `{python} DistancePenalty.sol_kms_str` km/s = **`{python} DistancePenalty.rtt_formatted_str` ms**.
|
||
3. **The Result**: Your safety budget is already **negative** (`{python} DistancePenalty.deficit_str` ms) before the model even starts its first calculation.
|
||
|
||
**The Engineering Conclusion**: Physics has made Cloud ML **impossible** for this application. You must move to the Edge.
|
||
:::
|
||
|
||
### Cloud Infrastructure and Scale {#sec-ml-systems-cloud-infrastructure-scale-f0b1}
|
||
|
||
\index{Cloud ML!accelerator infrastructure} Cloud ML aggregates computational resources in data centers at unprecedented scale. @fig-cloudml-example captures the physical scale behind this abstraction: Google's Cloud TPU[^fn-tpu-specialization] data center, where row upon row of specialized accelerators deliver petaflop-scale training throughput. @tbl-representative-systems quantifies how cloud systems provide orders-of-magnitude more compute and memory bandwidth than mobile devices, at correspondingly higher power and operational cost. Modern cloud accelerator systems operate at petaflops to exaflops of peak reduced-precision throughput and require megawatt-scale facility power in large clusters. These facilities enable workloads that are impractical on resource-constrained devices, but their remote location introduces critical trade-offs: network round-trip latency of 100-500 ms eliminates real-time applications, and operational costs scale linearly with usage.
|
||
|
||
[^fn-tpu-specialization]: **Tensor Processing Unit (TPU)**: A custom-built processor (ASIC) that delivers petaflop-scale throughput by hard-wiring its architecture for the matrix multiplication operations that dominate ML workloads. This extreme specialization trades general-purpose flexibility for a >10$\times$ improvement in performance-per-watt compared to a general-purpose accelerator on the same ML task. The high cost of deploying these accelerators at datacenter scale is therefore only economical for massive, sustained ML computation. \index{TPU!specialization trade-off}
|
||
|
||
:::: {#fig-cloudml-example fig-env="figure" fig-pos="t" fig-cap="**Cloud Data Center Scale**: Rows of server racks illuminated by blue LEDs extend across a Google Cloud TPU data center floor, housing thousands of specialized AI accelerator chips that collectively deliver petaflop-scale training throughput. Source: [@google2024gemini]." fig-alt="Aerial view of Google Cloud TPU data center with long rows of server racks illuminated by blue LEDs extending toward the horizon across a large facility floor."}
|
||

|
||
::::
|
||
|
||
Cloud ML excels at processing massive data volumes through parallelized architectures, enabling training on datasets requiring hundreds of terabytes of storage and petaflops of computation—resources that remain impractical on constrained devices. The training techniques covered in @sec-model-training and the hardware analysis in @sec-hardware-acceleration explain how this scale is achieved.
|
||
|
||
Beyond raw computation, cloud infrastructure creates deployment flexibility through cloud APIs, making trained models accessible worldwide across mobile, web, and IoT platforms. Shared infrastructure enables multiple teams to collaborate simultaneously with integrated version control, while pay-as-you-go pricing models[^fn-cloud-elastic-cost] eliminate upfront capital expenditure and scale elastically with demand.
|
||
|
||
A common misconception holds that Cloud ML's vast computational resources make it universally superior. Exceptional computational power and storage do not automatically translate to optimal solutions for all applications. The **Data Gravity Invariant**\index{Data Gravity Invariant!cloud limitations} (Part I) explains why: as data scales, the cost of moving it to compute ($C_{move}(D) \gg C_{move}(Compute)$) eventually dominates. The trade-offs listed in the definition above become concrete when we consider where edge and embedded deployments excel: real-time response with sub-10 ms decision making in autonomous control loops, strict data privacy for medical devices processing patient data, predictable costs through one-time hardware investment versus recurring cloud fees, or operation in disconnected environments such as industrial equipment in remote locations. The optimal deployment paradigm depends on specific application requirements rather than raw computational capability.
|
||
|
||
[^fn-cloud-elastic-cost]: **Pay-as-You-Go Pricing**: A cloud economic model where users pay for accelerator-hours consumed rather than hardware owned. Elastic pricing converts the fixed cost of idle $R_{peak}$ into a variable cost proportional to actual utilization, but the inverse also holds: sustained 24/7 workloads (continuous inference serving) often cost 2--3$\times$ more on cloud than equivalent on-premises hardware amortized over three years, a crossover that drives the TCO analysis later in this section. \index{Cloud Economics!elastic pricing}
|
||
|
||
### Cloud ML Trade-offs and Constraints {#sec-ml-systems-cloud-ml-tradeoffs-constraints-96ed}
|
||
|
||
\index{Cloud ML!latency limitations} \index{Cloud ML!privacy concerns} \index{GDPR compliance!cloud deployment} \index{HIPAA compliance!cloud deployment}Cloud ML's advantages carry inherent trade-offs that shape deployment decisions. Latency is the most consequential: network round-trip delays of 100-500 ms make cloud processing unsuitable for real-time applications requiring sub-10 ms responses, such as autonomous vehicles and industrial control systems. Unpredictable response times further complicate performance monitoring and debugging across geographically distributed infrastructure.
|
||
|
||
\index{federated learning!privacy preservation}
|
||
Privacy and security pose serious challenges for cloud deployment. Transmitting sensitive data to remote data centers creates vulnerabilities and complicates regulatory compliance. Organizations handling data subject to regulations like GDPR[^fn-gdpr-ml-constraint] or HIPAA[^fn-hipaa-ml-overhead] must implement comprehensive security measures including encryption, strict access controls, and continuous monitoring to meet stringent data handling requirements. Privacy-preserving ML techniques, including federated learning and differential privacy, address these challenges at the systems level.
|
||
|
||
[^fn-gdpr-ml-constraint]: **GDPR (General Data Protection Regulation)**: The European privacy framework (2018) whose "Right to be Forgotten" provision creates a systems constraint unique to ML: deleting a user's data may require retraining or fine-tuning any model that learned from it, because weight updates are not individually reversible. This transforms a legal requirement into a compute cost that scales with model size and retraining frequency. \index{GDPR!ML retraining constraint}
|
||
|
||
[^fn-hipaa-ml-overhead]: **HIPAA (Health Insurance Portability and Accountability Act)**: This US law translates the security measures from the context sentence—encryption, access controls, and monitoring—into direct systems-level costs like isolated compute, immutable logging for every inference, and end-to-end data encryption. These non-negotiable safeguards are the source of the "stringent data handling requirements" and typically add 15-30% to infrastructure and operational overhead for a production ML system. \index{HIPAA!infrastructure overhead}
|
||
|
||
Cost management introduces operational complexity requiring total cost of ownership (TCO)[^fn-tco-deployment]\index{Total Cost of Ownership (TCO)!cloud vs. edge}\index{TCO analysis!deployment decisions} analysis rather than naive unit comparisons. A worked *cloud vs. edge TCO* comparison illustrates the gap between sticker price and true system cost.
|
||
|
||
[^fn-tco-deployment]: **Total Cost of Ownership (TCO)**: This analysis quantifies the gap between sticker price and true system cost by including all direct and indirect costs (power, cooling, labor) over a system's lifetime. The *cloud vs. edge* decision makes this explicit, trading high upfront capital expense (CapEx) for hardware against recurring operational expenses (OpEx) for cloud services. For an on-premise GPU, the initial purchase price is often only 30–40% of the 3-year TCO, with the rest dominated by these operational costs. \index{TCO!deployment economics}
|
||
|
||
```{python}
|
||
#| label: tco-calc
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ CLOUD VS. EDGE TOTAL COST OF OWNERSHIP (TCO)
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: "Cloud vs. Edge TCO" worked example callout in
|
||
# │ @sec-ml-systems-cloud-ml-tradeoffs-constraints-96ed
|
||
# │
|
||
# │ Goal: Compute and compare 3-year annualized TCO for cloud (AWS A10G) vs.
|
||
# │ on-premises edge (GenericServer) serving 1 million requests/day.
|
||
# │ Show: That edge saves ~45% at this volume, but labor (~60% of edge cost)
|
||
# │ dominates — making "minimize compute" a misleading optimization target.
|
||
# │ How: Model cloud CapEx (GPU hours + egress + load balancer + logs) and
|
||
# │ edge CapEx/OpEx (amortized hardware + power + cooling + fiber + labor)
|
||
# │ using HOURS_PER_YEAR, CLOUD_EGRESS_PER_GB, and CLOUD_ELECTRICITY_PER_KWH.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (DAYS_PER_YEAR, HOURS_PER_YEAR, CLOUD_EGRESS_PER_GB,
|
||
# │ CLOUD_ELECTRICITY_PER_KWH, USD, GB, watt, ureg, MILLION, MIB_TO_BYTES)
|
||
# │ Exports: CloudEdgeTCO.c_gpu_str, CloudEdgeTCO.c_egress_str,
|
||
# │ CloudEdgeTCO.c_lb_str, CloudEdgeTCO.c_logs_str,
|
||
# │ CloudEdgeTCO.c_total_str, CloudEdgeTCO.e_capex_str,
|
||
# │ CloudEdgeTCO.e_power_str, CloudEdgeTCO.e_cool_str,
|
||
# │ CloudEdgeTCO.e_net_str, CloudEdgeTCO.e_labor_str,
|
||
# │ CloudEdgeTCO.e_total_str, CloudEdgeTCO.edge_savings_str,
|
||
# │ CloudEdgeTCO.labor_pct_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim import Hardware
|
||
from mlsysim.core.constants import (
|
||
DAYS_PER_YEAR, HOURS_PER_YEAR, CLOUD_EGRESS_PER_GB,
|
||
CLOUD_ELECTRICITY_PER_KWH, USD, GB, watt, ureg,
|
||
MILLION, MIB_TO_BYTES,
|
||
)
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class CloudEdgeTCO:
|
||
"""
|
||
Namespace for Cloud vs. Edge TCO comparison.
|
||
Scenario: 1M req/day inference service cost analysis.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
# Scenario
|
||
requests_per_day = 1_000_000
|
||
inference_ms = 10
|
||
response_kb = 100
|
||
|
||
# Cloud (AWS 2024)
|
||
gpu_price_per_hr = 0.75 # A10G
|
||
gpu_instances = 4
|
||
egress_per_gb = CLOUD_EGRESS_PER_GB.m_as(USD / GB)
|
||
lb_base_per_hr = 0.025
|
||
lb_lcu_per_hr = 0.008
|
||
avg_lcu = 50
|
||
|
||
# Edge
|
||
server = Hardware.Edge.GenericServer
|
||
server_cost = 15000
|
||
server_life_years = 3
|
||
power_watts = server.tdp.m_as(watt)
|
||
electricity_per_kwh = CLOUD_ELECTRICITY_PER_KWH.m_as(USD / ureg.kilowatt_hour)
|
||
cooling_overhead = 0.30
|
||
fiber_annual = 1200
|
||
devops_fte = 0.1
|
||
devops_salary = 150000
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
# Step 1: Cloud
|
||
c_gpu = gpu_instances * HOURS_PER_YEAR * gpu_price_per_hr
|
||
egress_gb_per_day = (requests_per_day * response_kb) / MIB_TO_BYTES
|
||
c_egress = egress_gb_per_day * DAYS_PER_YEAR * egress_per_gb
|
||
c_lb = lb_base_per_hr * HOURS_PER_YEAR + lb_lcu_per_hr * avg_lcu * HOURS_PER_YEAR
|
||
c_logs = 2000
|
||
c_total = c_gpu + c_egress + c_lb + c_logs
|
||
|
||
# Step 2: Edge
|
||
e_capex = server_cost / server_life_years
|
||
e_power = (power_watts * HOURS_PER_YEAR * electricity_per_kwh) / 1000
|
||
e_cool = e_power * cooling_overhead
|
||
e_net = fiber_annual
|
||
e_labor = devops_fte * devops_salary
|
||
e_total = e_capex + e_power + e_cool + e_net + e_labor
|
||
|
||
edge_savings_pct = ((c_total - e_total) / c_total) * 100
|
||
labor_pct = (e_labor / e_total) * 100
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(c_total >= e_total, f"Edge should be cheaper at 1M volume. Cloud=${c_total:.0f}, Edge=${e_total:.0f}")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
requests_str = f"{requests_per_day/MILLION:.0f}M"
|
||
inference_str = f"{inference_ms}ms"
|
||
response_str = f"{response_kb}KB"
|
||
gpu_instances_str = f"{gpu_instances}"
|
||
gpu_price_str = f"${gpu_price_per_hr:.2f}"
|
||
egress_gb_str = fmt(egress_gb_per_day, precision=0, commas=False)
|
||
|
||
c_gpu_str = f"~${c_gpu:,.0f}"
|
||
c_egress_str = f"~${c_egress:,.0f}"
|
||
c_lb_str = f"~${c_lb:,.0f}"
|
||
c_logs_str = f"~${c_logs:,.0f}"
|
||
c_total_str = f"~${c_total:,.0f}/year"
|
||
|
||
e_capex_str = f"~${e_capex:,.0f}"
|
||
e_power_str = f"~${e_power:,.0f}"
|
||
e_cool_str = f"~${e_cool:,.0f}"
|
||
e_net_str = f"~${e_net:,.0f}"
|
||
e_labor_str = f"~${e_labor:,.0f}"
|
||
e_total_str = f"~${e_total:,.0f}/year"
|
||
|
||
edge_savings_str = f"{edge_savings_pct:.0f}%"
|
||
labor_pct_str = f"{labor_pct:.0f}%"
|
||
|
||
# Additional Outputs for Prose
|
||
server_cost_str = f"${server_cost:,}"
|
||
server_life_str = f"{server_life_years}"
|
||
power_str = f"{power_watts}W"
|
||
electricity_str = f"${electricity_per_kwh:.2f}"
|
||
devops_fte_str = f"{devops_fte}"
|
||
devops_salary_str = f"${devops_salary:,}"
|
||
```
|
||
|
||
::: {.callout-notebook title="Cloud vs. Edge TCO"}
|
||
**Scenario**: A vision system serving `{python} CloudEdgeTCO.requests_str` daily inferences (ResNet-50 scale, `{python} CloudEdgeTCO.inference_str` latency, `{python} CloudEdgeTCO.response_str` response).
|
||
|
||
**Cloud Implementation** (AWS/GCP pricing, 2024)
|
||
|
||
| **Cost Component** | **Calculation** | **Annual Cost** |
|
||
|:-------------------------|-----------------------------------------------------------------------------------------------------------------------:|----------------------------------------:|
|
||
| **GPU inference (A10G)** | `{python} CloudEdgeTCO.gpu_instances_str` instances$\times$ 8,760 hrs$\times$ `{python} CloudEdgeTCO.gpu_price_str`/hr | `{python} CloudEdgeTCO.c_gpu_str` |
|
||
| **Network egress** | `{python} CloudEdgeTCO.egress_gb_str` GB/day$\times$ 365$\times$ USD 0.09/GB | `{python} CloudEdgeTCO.c_egress_str` |
|
||
| **Load balancer** | USD 0.025/hr + LCU charges | `{python} CloudEdgeTCO.c_lb_str` |
|
||
| **CloudWatch/logging** | Monitoring, alerts | `{python} CloudEdgeTCO.c_logs_str` |
|
||
| **Total Cloud** | | **`{python} CloudEdgeTCO.c_total_str`** |
|
||
|
||
**Edge Implementation** (On-premise NVIDIA T4 server)
|
||
|
||
| **Cost Component** | **Calculation** | **Annual Cost** |
|
||
|:---------------------|--------------------------------------------------------------------------------------------------------:|----------------------------------------:|
|
||
| **Hardware CAPEX** | `{python} CloudEdgeTCO.server_cost_str` server ÷ `{python} CloudEdgeTCO.server_life_str`-year life | `{python} CloudEdgeTCO.e_capex_str` |
|
||
| **Power (24/7)** | `{python} CloudEdgeTCO.power_str`$\times$ 8,760 hrs$\times$ `{python} CloudEdgeTCO.electricity_str`/kWh | `{python} CloudEdgeTCO.e_power_str` |
|
||
| **Cooling overhead** | ~30% of power | `{python} CloudEdgeTCO.e_cool_str` |
|
||
| **Network (fiber)** | Fixed line for remote management | `{python} CloudEdgeTCO.e_net_str` |
|
||
| **DevOps labor** | `{python} CloudEdgeTCO.devops_fte_str` FTE$\times$ `{python} CloudEdgeTCO.devops_salary_str` salary | `{python} CloudEdgeTCO.e_labor_str` |
|
||
| **Total Edge** | | **`{python} CloudEdgeTCO.e_total_str`** |
|
||
|
||
**Break-even Analysis**: @eq-edge-breakeven determines when edge deployment becomes cost-effective. **Edge Fixed Costs** include hardware amortization and maintenance, **Cloud Variable Cost per Unit** is the per-inference cloud pricing, and **Capacity** is the maximum inference rate of the edge system:
|
||
|
||
$$\text{Break-even utilization} = \frac{\text{Edge Fixed Costs}}{\text{Cloud Variable Cost per Unit} \times \text{Capacity}}$$ {#eq-edge-breakeven}
|
||
|
||
At low volume (<500K inferences/day), cloud wins due to no fixed costs. At high, steady volume (>1M/day), edge wins by ~`{python} CloudEdgeTCO.edge_savings_str`. The crossover occurs around **60% sustained utilization**.
|
||
|
||
**Key insight**: Edge TCO is dominated by **labor** (`{python} CloudEdgeTCO.labor_pct_str`), not hardware. Organizations without existing DevOps capacity should factor in the full cost of maintaining on-premise infrastructure.
|
||
:::
|
||
|
||
Unpredictable usage spikes complicate budgeting, requiring comprehensive monitoring and cost governance frameworks.
|
||
|
||
\index{vendor lock-in!cloud deployment}
|
||
Network dependency creates a further constraint: any connectivity disruption directly impacts system availability, particularly where network access is limited or unreliable. Vendor lock-in compounds this problem, as dependencies on specific tools and APIs create portability challenges when transitioning between providers. Organizations must balance these constraints against cloud benefits based on their specific application requirements and risk tolerance.
|
||
|
||
Despite these trade-offs, Cloud ML's computational advantages make it indispensable for consumer applications operating at global scale.
|
||
|
||
### Large-Scale Training and Inference {#sec-ml-systems-largescale-training-inference-e16d}
|
||
|
||
\index{Cloud ML!training at scale} \index{hybrid architectures!wake-word detection}
|
||
\index{voice assistants!hybrid architecture}
|
||
\index{wake-word detection!layered architecture}
|
||
Cloud ML's computational advantages manifest most visibly in consumer-facing applications that require massive scale. Virtual assistants like Siri and Alexa illustrate the hybrid architectures that characterize modern ML systems: wake-word detection runs on dedicated low-power hardware (often sub-milliwatt) directly on the device, enabling always-on listening without draining batteries; initial speech recognition increasingly runs on-device for privacy and responsiveness; and complex natural language understanding and generation use cloud infrastructure for access to larger models and broader knowledge.
|
||
|
||
Economics drive this architecture as much as latency. Attempting to process voice interactions for billions of devices entirely in the cloud runs into both an economic and an infrastructure ceiling, limits that the following analysis of the voice assistant wall quantifies.
|
||
|
||
```{python}
|
||
#| label: voice-assistant-wall-calc
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ VOICE ASSISTANT WALL: ECONOMICS + INFRASTRUCTURE
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: "The Voice Assistant Wall" callout in
|
||
# │ @sec-ml-systems-largescale-training-inference-e16d
|
||
# │
|
||
# │ Goal: Demonstrate why cloud-only voice processing fails at 1-billion-device
|
||
# │ scale on both economics and infrastructure grounds simultaneously.
|
||
# │ Show: The $500M/year economic wall and the 20+ datacenter infrastructure
|
||
# │ wall that emerge from 1B devices × 20 queries/day at 200 ms/query.
|
||
# │ How: Calculate total annual cloud cost, GPU-days required, peak datacenter
|
||
# │ count (with 3× peak multiplier), and raw audio bandwidth (TB/s).
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (BILLION, TRILLION, SEC_PER_HOUR, HOURS_PER_DAY,
|
||
# │ BITS_PER_BYTE, KIB_TO_BYTES, MIB_TO_BYTES, MS_PER_SEC)
|
||
# │ Exports: VoiceAssistantWall.ww_devices_b_str,
|
||
# │ VoiceAssistantWall.ww_cloud_cost_str,
|
||
# │ VoiceAssistantWall.ww_total_cost_str,
|
||
# │ VoiceAssistantWall.ww_edge_power_range_str,
|
||
# │ VoiceAssistantWall.ww_edge_cost_str,
|
||
# │ VoiceAssistantWall.vi_devices_str,
|
||
# │ VoiceAssistantWall.vi_queries_str,
|
||
# │ VoiceAssistantWall.vi_total_queries_str,
|
||
# │ VoiceAssistantWall.vi_gpu_hours_str,
|
||
# │ VoiceAssistantWall.vi_datacenters_avg_str,
|
||
# │ VoiceAssistantWall.vi_datacenters_peak_str,
|
||
# │ VoiceAssistantWall.vi_audio_kb_str,
|
||
# │ VoiceAssistantWall.vi_total_audio_tb_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.core.constants import (
|
||
BILLION, TRILLION, SEC_PER_HOUR, HOURS_PER_DAY,
|
||
BITS_PER_BYTE, KIB_TO_BYTES, MIB_TO_BYTES, MS_PER_SEC
|
||
)
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class VoiceAssistantWall:
|
||
"""
|
||
Namespace for Voice Assistant Scaling logic.
|
||
Scenario: 1 Billion devices, economics vs infrastructure limits.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
# Economics
|
||
ww_devices_b = 1
|
||
ww_cloud_cost_per_device = 0.50
|
||
ww_edge_power_min_mw = 0.1
|
||
ww_edge_power_max_mw = 1
|
||
ww_edge_cost_per_year = 0.01
|
||
|
||
# Infrastructure
|
||
vi_devices_b = 1
|
||
vi_queries_per_day = 20
|
||
vi_gpu_ms_per_query = 200
|
||
vi_gpus_per_datacenter = 10_000
|
||
vi_audio_sample_rate = 16_000
|
||
vi_audio_bits = 16
|
||
vi_waking_hours = 16
|
||
vi_peak_multiplier = 3
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
# Step 1: Economics
|
||
ww_total_cloud_cost = ww_devices_b * BILLION * ww_cloud_cost_per_device
|
||
|
||
# Step 2: Infrastructure - Compute
|
||
vi_total_queries_day = vi_devices_b * BILLION * vi_queries_per_day
|
||
vi_gpu_seconds_day = vi_total_queries_day * vi_gpu_ms_per_query / MS_PER_SEC
|
||
vi_gpu_hours_day = vi_gpu_seconds_day / SEC_PER_HOUR
|
||
vi_datacenters_avg = vi_gpu_hours_day / (vi_gpus_per_datacenter * HOURS_PER_DAY)
|
||
vi_peak_ratio = vi_peak_multiplier * (HOURS_PER_DAY / vi_waking_hours)
|
||
vi_datacenters_peak = vi_datacenters_avg * vi_peak_ratio
|
||
|
||
# Step 3: Infrastructure - Bandwidth
|
||
vi_audio_bytes_per_sec = vi_audio_sample_rate * (vi_audio_bits / BITS_PER_BYTE)
|
||
vi_audio_kb_per_sec = vi_audio_bytes_per_sec / KIB_TO_BYTES
|
||
# Step 4: Total audio bandwidth across 1B devices
|
||
vi_total_audio_tb_per_sec = (vi_audio_bytes_per_sec * vi_devices_b * BILLION) / TRILLION
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(vi_datacenters_peak >= 20, f"Infrastructure wall ({vi_datacenters_peak:.0f} DCs) unexpectedly low.")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
# Economics Strings
|
||
ww_devices_b_str = fmt(ww_devices_b, precision=0, commas=False)
|
||
ww_cloud_cost_str = fmt(ww_cloud_cost_per_device, precision=2, commas=False)
|
||
ww_total_cost_str = fmt(ww_total_cloud_cost, precision=0, commas=True)
|
||
ww_edge_power_range_str = f"{ww_edge_power_min_mw}--{ww_edge_power_max_mw}"
|
||
ww_edge_cost_str = fmt(ww_edge_cost_per_year, precision=2, commas=False)
|
||
|
||
# Infrastructure Strings
|
||
vi_devices_str = fmt(vi_devices_b, precision=0, commas=False)
|
||
vi_queries_str = fmt(vi_queries_per_day, precision=0, commas=False)
|
||
vi_total_queries_str = fmt(vi_total_queries_day / BILLION, precision=0, commas=False)
|
||
vi_gpu_ms_str = fmt(vi_gpu_ms_per_query, precision=0, commas=False)
|
||
vi_gpu_hours_str = fmt(vi_gpu_hours_day, precision=0, commas=True)
|
||
vi_gpus_dc_str = fmt(vi_gpus_per_datacenter, precision=0, commas=True)
|
||
vi_dc_avg_str = fmt(vi_datacenters_avg, precision=0, commas=False)
|
||
vi_dc_peak_str = fmt(vi_datacenters_peak, precision=0, commas=False)
|
||
vi_peak_ratio_str = fmt(vi_peak_ratio, precision=1, commas=False)
|
||
vi_audio_kb_str = fmt(vi_audio_kb_per_sec, precision=0, commas=False)
|
||
vi_audio_tb_str = fmt(vi_total_audio_tb_per_sec, precision=0, commas=False)
|
||
```
|
||
|
||
::: {.callout-notebook title="The Voice Assistant Wall"}
|
||
\index{infrastructure scaling!voice assistants}\index{Cloud ML!scaling limits}**Scenario**: `{python} VoiceAssistantWall.ww_devices_b_str` billion voice assistant devices (smartphones, smart speakers, earbuds). Can cloud data centers handle this?
|
||
|
||
**Part 1 — The Economic Wall**
|
||
|
||
- **Cloud Cost**: ~USD `{python} VoiceAssistantWall.ww_cloud_cost_str` per device/year → `{python} VoiceAssistantWall.ww_devices_b_str` B devices = **USD `{python} VoiceAssistantWall.ww_total_cost_str`/year**. Economically prohibitive for a free feature.
|
||
- **TinyML Alternative**: `{python} VoiceAssistantWall.ww_edge_power_range_str` mW local wake-word detection, <USD `{python} VoiceAssistantWall.ww_edge_cost_str`/year per device. Viable at any scale.
|
||
|
||
**Part 2 — The Infrastructure Wall**
|
||
|
||
The economic argument is compelling, but the *physics* argument is decisive:
|
||
|
||
1. **Query volume**: `{python} VoiceAssistantWall.vi_devices_str` B devices$\times$ `{python} VoiceAssistantWall.vi_queries_str` queries/day = **`{python} VoiceAssistantWall.vi_total_queries_str` billion queries/day**.
|
||
2. **GPU demand**: Each query requires ~`{python} VoiceAssistantWall.vi_gpu_ms_str` ms of GPU time. Total: **`{python} VoiceAssistantWall.vi_gpu_hours_str` GPU-hours/day**.
|
||
3. **Data center capacity**: A large data center (~`{python} VoiceAssistantWall.vi_gpus_dc_str` GPUs) provides 240,000 GPU-hours/day.
|
||
4. **Average requirement**: ~**`{python} VoiceAssistantWall.vi_dc_avg_str` dedicated data centers** just for voice inference.
|
||
5. **Peak reality**: Queries cluster in waking hours (~`{python} VoiceAssistantWall.vi_peak_ratio_str`$\times$ peak-to-average), requiring **~`{python} VoiceAssistantWall.vi_dc_peak_str` data centers** at peak.
|
||
|
||
**The Bandwidth Wall**: Wake-word detection requires *continuous* audio monitoring. If devices streamed audio to the cloud (16 kHz, 16-bit), each transmits ~`{python} VoiceAssistantWall.vi_audio_kb_str` KB/s. Across `{python} VoiceAssistantWall.vi_devices_str` billion devices: **`{python} VoiceAssistantWall.vi_audio_tb_str` TB/s**—a significant fraction of total global internet backbone capacity.
|
||
|
||
**The Engineering Conclusion**: Cloud-only voice processing is not merely expensive; it is **physically impossible** at global scale. Local wake-word detection is an infrastructure necessity, not an optimization.
|
||
:::
|
||
|
||
This demonstrates a core systems principle: deployment decisions are constrained by performance requirements, economic realities, and infrastructure physics. The hybrid approach reduces end-to-end latency relative to pure cloud processing while maintaining the computational power needed for complex language understanding, all within sustainable cost boundaries.
|
||
|
||
Recommendation engines deployed by Netflix and Amazon demonstrate another compelling application of cloud resources. These systems process massive datasets using collaborative filtering and deep learning architectures like the **Deep Learning Recommendation Model (DLRM)**[^fn-dlrm-memory-bound] to uncover patterns in user preferences. DLRM exemplifies a memory-capacity-bound workload: its massive embedding tables, representing millions of users and items, can exceed terabytes in size, requiring distributed memory across many servers just to store the model parameters. Cloud computational resources enable continuous updates and refinements as user data grows, with Netflix processing over 100 billion data points daily to deliver personalized content suggestions that directly enhance user engagement.
|
||
|
||
These applications share a common thread: they trade latency for scale, accepting hundreds of milliseconds of round-trip delay in exchange for access to computational resources that no other paradigm can provide. Fraud detection systems analyzing millions of transactions, recommendation engines processing terabytes of embedding tables, and language models generating text one token at a time all depend on this bargain. Yet as the Voice Assistant Wall demonstrated, there exist applications where no amount of cloud compute can compensate for the physics of distance. When latency budgets drop below what the speed of light permits, or when data volumes exceed what networks can carry, the computation must move closer to the data source.
|
||
|
||
[^fn-dlrm-memory-bound]: **Deep Learning Recommendation Model (DLRM)**: Meta's 2019 architecture that exemplifies the "Sparse Scatter" archetype. Embedding tables for production recommendation systems can exceed 100 TB, making DLRM constrained by memory capacity and communication $BW$ rather than raw $R_{peak}$. This inversion of the typical compute-bound assumption forces specialized cluster designs where memory, not arithmetic, is the scarce resource. \index{DLRM!memory-bound constraint}
|
||
|
||
## Edge ML: Latency and Privacy {#sec-ml-systems-edge-ml-reducing-latency-privacy-risk-2625}
|
||
|
||
\index{Edge ML!distance penalty} \index{Edge ML!data sovereignty}When latency budgets drop below 100 ms, cloud infrastructure hits a hard physical wall. The Distance Penalty means the speed of light alone imposes minimum latencies of 40--150 ms for cross-region requests—before any computation begins. When an autonomous vehicle needs to decide whether to brake, or an industrial robot needs to stop before hitting an obstacle, 100 ms is an eternity. The logical engineering response is to move the computation closer to the data source.
|
||
|
||
Edge ML emerged from this constraint, trading unlimited computational resources for sub-100 ms latency and local data retention. In Archetype terms, edge deployment transforms the optimization target: a Bandwidth Hog workload like LLM inference that is memory-bound in the cloud becomes *latency-bound* at the edge, where the 50–100 ms network penalty dominates the 10–20 ms compute time. Edge hardware with sufficient local memory can eliminate this penalty entirely, shifting the bottleneck back to the underlying memory bandwidth constraint. Recall the Iron Law from @eq-iron-law-extended: by processing locally, edge deployment eliminates the $D_{vol}/BW_{IO}$ (network I/O) term entirely, collapsing the latency to $\max(D_{vol}/BW, O/(R_{peak} \cdot \eta)) + L_{lat}$—the same memory-vs-compute trade-off, but without the network penalty that dominates cloud inference.
|
||
|
||
This paradigm shift is essential for applications where cloud's 100--500 ms round-trip delays are unacceptable. Autonomous systems requiring split-second decisions and industrial IoT[^fn-iiot-edge-latency] applications demanding real-time response cannot tolerate network delays. Similarly, applications subject to strict data privacy regulations must process information locally rather than transmitting it to remote data centers. Edge devices (gateways and IoT hubs) occupy a middle ground in the deployment spectrum, maintaining acceptable performance while operating under intermediate resource constraints.
|
||
|
||
[^fn-iiot-edge-latency]: **Industrial IoT (IIoT)**: A domain where latency constraints are set by physical safety, not user perception. The 100+ ms round-trip delay mentioned is intolerable for a robotic arm that must halt within 5 ms of detecting a human. This forces computation to the edge, trading near-zero network latency for significant on-device compute ($R_{peak}$) constraints. \index{IIoT!latency constraint}
|
||
|
||
We define this paradigm formally as *Edge ML*.
|
||
|
||
::: {.callout-definition title="Edge ML"}
|
||
|
||
***Edge Machine Learning***\index{Edge ML!definition} is the deployment paradigm optimized for **Latency Determinism** and **Data Locality** by locating computation physically adjacent to data sources.
|
||
|
||
1. **Significance (Quantitative):** It circumvents the **Distance Penalty** ($L_{lat}$) of the cloud, trading elastic scale for a fixed **Local Compute Capacity** ($R_{peak}$).
|
||
2. **Distinction (Durable):** Unlike **Cloud ML**, which prioritizes **Throughput**, Edge ML prioritizes **Determinism** and privacy. Unlike **TinyML**, Edge ML may still use workstation-class accelerators (GPGPUs).
|
||
3. **Common Pitfall:** A frequent misconception is that Edge ML refers to a specific hardware class. In reality, it is a **Location Paradigm**: it spans from IoT gateways to on-premise servers, unified by physical proximity to the data source.
|
||
|
||
:::
|
||
|
||
@fig-edge-ml organizes these trade-offs into four operational dimensions. The **Characteristics** branch highlights decentralized processing, which drives the key **Benefit** of reduced latency. This trade-off, however, introduces **Challenges** in maintenance and security, as the physical hardware is distributed and harder to secure than a centralized datacenter.
|
||
|
||
::: {#fig-edge-ml fig-env="figure" fig-pos="t" fig-cap="**Edge ML Decomposition.** Characteristics, benefits, challenges, and representative applications of edge machine learning, where decentralized processing on nearby hardware reduces latency and network dependence at the cost of constrained compute and memory." fig-alt="Tree diagram with Edge ML branching to four categories: Characteristics, Benefits, Challenges, and Examples, listing items like decentralized processing, reduced latency, security concerns, and industrial IoT."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
|
||
\tikzset{
|
||
Box/.style={inner xsep=2pt,
|
||
draw=GreenLine,
|
||
fill=GreenL!50,
|
||
node distance=0.4,
|
||
line width=0.75pt,
|
||
anchor=west,
|
||
text width=37mm,align=flush center,
|
||
minimum width=37mm, minimum height=9.5mm
|
||
},
|
||
Box2/.style={Box,draw=BlueLine,fill=BlueL!50, text width=27mm, minimum width=27mm
|
||
},
|
||
Box3/.style={Box,draw=OrangeLine,fill=OrangeL!40, text width=28mm, minimum width=28mm
|
||
},
|
||
Box4/.style={Box,draw=VioletLine,fill=VioletL2!40, text width=30mm, minimum width=30mm
|
||
},
|
||
Line/.style={line width=1.0pt,black!50,text=black,-{Triangle[width=0.8*6pt,length=0.98*6pt]}},
|
||
}
|
||
\node[Box4, fill=VioletL2!90!violet!50,](B1){Characteristics};
|
||
\node[Box2,right=2 of B1,fill=BlueL](B2){Benefits};
|
||
\node[Box,right=2 of B2,fill=GreenL](B3){Challenges};
|
||
\node[Box3,right=2 of B3,fill=OrangeL](B4){Examples};
|
||
\node[Box,draw=OliveLine,fill=OliveL!30, minimum height=11.5mm,
|
||
above=1of $(B2.north east)!0.5!(B3.north west)$](B0){Edge ML};
|
||
%
|
||
\node[Box4,below=0.7 of B1](B11){Decentralized Data Processing};
|
||
\node[Box4,below=of B11](B12){Local Data Storage and Computation};
|
||
\node[Box4,below=of B12](B13){Proximity to Data Sources};
|
||
%
|
||
\node[Box2,below=0.7 of B2](B21){Reduced Latency};
|
||
\node[Box2,below=of B21](B22){Enhanced Data Privacy};
|
||
\node[Box2,below=of B22](B23){Lower Bandwidth Usage};
|
||
%
|
||
\node[Box,below=0.7 of B3](B31){Security Concerns at the Edge Nodes};
|
||
\node[Box,below=of B31](B32){Complexity in Managing Edge Nodes};
|
||
\node[Box,below=of B32](B33){Limited Computational Resources};
|
||
%
|
||
\node[Box3,below=0.7 of B4](B41){Industrial IoT};
|
||
\node[Box3,below=of B41](B42){Smart Homes and Cities};
|
||
\node[Box3,below=of B42](B43){Autonomous Vehicles};
|
||
%
|
||
\foreach \i in{1,2,3}{
|
||
\draw[Line](B1.west)--++(180:0.5)|-(B1\i);
|
||
}
|
||
\foreach \i in{1,2,3}{
|
||
\draw[Line](B2.west)--++(180:0.5)|-(B2\i);
|
||
}
|
||
\foreach \i in{1,2,3}{
|
||
\draw[Line](B3.west)--++(180:0.5)|-(B3\i);
|
||
}
|
||
\foreach \i in{1,2,3}{
|
||
\draw[Line](B4.west)--++(180:0.5)|-(B4\i);
|
||
}
|
||
\foreach \x in{1,2,3,4}{
|
||
\draw[Line](B0)-|(B\x);
|
||
}
|
||
\end{tikzpicture}
|
||
```
|
||
:::
|
||
|
||
The benefits of lower bandwidth usage and reduced latency become stark when we examine real-world data rates. The defining characteristic of edge deployment is not just *where* processing occurs, but *how much data* that location must handle. The following analysis of *the bandwidth bottleneck* shows what happens when the data rate exceeds available network capacity.
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: bandwidth-bottleneck
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ BANDWIDTH BOTTLENECK CALCULATION
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: "The Bandwidth Bottleneck" worked example callout in
|
||
# │ @sec-ml-systems-edge-ml-benefits-deployment-challenges-b2d0
|
||
# │
|
||
# │ Goal: Prove that streaming 100 × 1080p cameras to the cloud is physically
|
||
# │ impossible over a 10 Gbps link and economically prohibitive via egress.
|
||
# │ Show: That aggregate data rate (≈5 GB/s) exceeds the 10 Gbps line by 5×,
|
||
# │ and 24/7 egress costs $M/month — making local edge processing mandatory.
|
||
# │ How: Calculate bytes/frame × fps × cameras; compare to Ethernet_10G cap;
|
||
# │ use calc_monthly_egress_cost() for the economic wall.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (VIDEO_1080P_WIDTH, VIDEO_1080P_HEIGHT,
|
||
# │ VIDEO_BYTES_PER_PIXEL_RGB, VIDEO_FPS_STANDARD, CLOUD_EGRESS_PER_GB,
|
||
# │ MB, GB, second, MILLION), mlsysim.formulas (calc_monthly_egress_cost)
|
||
# │ Exports: BandwidthBottleneck.cam_rate_mbs_str,
|
||
# │ BandwidthBottleneck.total_rate_gbs_str,
|
||
# │ BandwidthBottleneck.monthly_cost_m_str,
|
||
# │ BandwidthBottleneck.net_cap_gbs_str,
|
||
# │ BandwidthBottleneck.bw_short_x_str,
|
||
# │ BandwidthBottleneck.num_cameras_str,
|
||
# │ BandwidthBottleneck.bb_fps_str,
|
||
# │ BandwidthBottleneck.egress_cost_str,
|
||
# │ BandwidthBottleneck.video_width_str,
|
||
# │ BandwidthBottleneck.video_height_str,
|
||
# │ BandwidthBottleneck.bytes_per_pixel_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim import Hardware
|
||
from mlsysim.core.formulas import calc_monthly_egress_cost
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
from mlsysim.core.constants import (
|
||
VIDEO_1080P_WIDTH, VIDEO_1080P_HEIGHT, VIDEO_BYTES_PER_PIXEL_RGB,
|
||
VIDEO_FPS_STANDARD, CLOUD_EGRESS_PER_GB, MB, GB, second, MILLION,
|
||
)
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class BandwidthBottleneck:
|
||
"""
|
||
Namespace for Bandwidth Bottleneck calculation.
|
||
Scenario: 100 cameras at 1080p saturating a 10Gbps link.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
num_cameras = 100
|
||
fps = VIDEO_FPS_STANDARD
|
||
width = VIDEO_1080P_WIDTH
|
||
height = VIDEO_1080P_HEIGHT
|
||
bpp = VIDEO_BYTES_PER_PIXEL_RGB
|
||
network = Hardware.Networks.Ethernet_10G
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
bytes_per_frame = width * height * bpp
|
||
bytes_per_sec_single = bytes_per_frame * fps
|
||
|
||
total_bytes_per_sec = (num_cameras * bytes_per_sec_single).to("byte/second")
|
||
network_cap_bytes = network.bandwidth.to("byte/second")
|
||
|
||
shortfall_ratio = (total_bytes_per_sec / network_cap_bytes).m_as('')
|
||
|
||
# Step 1: Cost (using helper formula)
|
||
monthly_cost = calc_monthly_egress_cost(total_bytes_per_sec, CLOUD_EGRESS_PER_GB)
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(total_bytes_per_sec > network_cap_bytes, f"Bandwidth ({total_bytes_per_sec}) fits within Network ({network_cap_bytes})! No bottleneck.")
|
||
check(shortfall_ratio >= 2, f"Shortfall ({shortfall_ratio:.1f}x) is too small to be a 'crisis'.")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
cam_rate_mbs_str = fmt(bytes_per_sec_single.m_as(MB/second), precision=0, commas=False)
|
||
total_rate_gbs_str = fmt(total_bytes_per_sec.m_as(GB/second), precision=1, commas=False)
|
||
monthly_cost_m_str = fmt(monthly_cost / MILLION, precision=1, commas=False)
|
||
net_cap_gbs_str = fmt(network.bandwidth.m_as(GB/second), precision=2, commas=False)
|
||
bw_short_x_str = fmt(shortfall_ratio, precision=0, commas=False)
|
||
|
||
num_cameras_str = f"{num_cameras}"
|
||
bb_fps_str = f"{int(fps.m_as('Hz'))}"
|
||
egress_cost_str = f"{CLOUD_EGRESS_PER_GB.m_as(USD / GB)}"
|
||
video_width_str = fmt(width, precision=0, commas=False)
|
||
video_height_str = fmt(height, precision=0, commas=False)
|
||
bytes_per_pixel_str = fmt(bpp, precision=0, commas=False)
|
||
|
||
class DataLocalityInvariant:
|
||
"""
|
||
Namespace for Data Locality Invariant.
|
||
Scenario: 4K video stream vs. Cloud offload.
|
||
"""
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
width = 3840
|
||
height = 2160
|
||
bpp = 3
|
||
fps = 60
|
||
net_bw_mbps = 100 # Home broadband
|
||
cloud_lat_ms = 100
|
||
edge_inf_ms = 10
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
frame_mb = (width * height * bpp) / 1e6
|
||
tx_time_ms = (frame_mb * 8 / net_bw_mbps) * 1000
|
||
remote_total_ms = cloud_lat_ms + edge_inf_ms
|
||
|
||
# Step 1: Decision
|
||
must_be_local = tx_time_ms > remote_total_ms
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(must_be_local, "4K video should require locality at 100Mbps!")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
frame_mb_str = fmt(frame_mb, precision=0)
|
||
tx_time_ms_str = fmt(tx_time_ms, precision=0)
|
||
remote_ms_str = fmt(remote_total_ms, precision=0)
|
||
net_bw_str = f"{net_bw_mbps}"
|
||
```
|
||
|
||
::: {.callout-notebook title="The Bandwidth Bottleneck"}
|
||
|
||
\index{bandwidth bottleneck!video streaming} \index{Edge ML!bandwidth reduction}**Problem**: You are designing a quality control system for a factory floor with **`{python} BandwidthBottleneck.num_cameras_str` cameras** running at **`{python} BandwidthBottleneck.bb_fps_str` FPS** with **1080p resolution**. Should you stream to the cloud or process at the edge?
|
||
|
||
**The Physics**:
|
||
|
||
1. **Raw data rate per camera**: `{python} BandwidthBottleneck.video_width_str`$\times$ `{python} BandwidthBottleneck.video_height_str`$\times$ `{python} BandwidthBottleneck.bytes_per_pixel_str` bytes$\times$ `{python} BandwidthBottleneck.bb_fps_str` FPS ≈ **`{python} BandwidthBottleneck.cam_rate_mbs_str` MB/s**.
|
||
2. **Total data rate**: `{python} BandwidthBottleneck.num_cameras_str` cameras$\times$ `{python} BandwidthBottleneck.cam_rate_mbs_str` MB/s = **`{python} BandwidthBottleneck.total_rate_gbs_str` GB/s**.
|
||
3. **Cloud upload cost**: At USD `{python} BandwidthBottleneck.egress_cost_str`/GB egress, streaming 24/7 costs **USD `{python} BandwidthBottleneck.monthly_cost_m_str` M/month**.
|
||
4. **Network reality**: Even a dedicated 10 Gbps line (`{python} BandwidthBottleneck.net_cap_gbs_str` GB/s) cannot carry the load—you need **`{python} BandwidthBottleneck.bw_short_x_str`$\times$ more bandwidth** than exists.
|
||
|
||
**The Engineering Conclusion**: Physics has made cloud streaming **impossible** for this application. Edge processing is not optional—it is mandatory. An edge server running local inference transmits only defect metadata (~1 KB per detection), reducing bandwidth requirements by **1,000,000$\times$**.
|
||
:::
|
||
|
||
The bandwidth calculation above reveals why edge processing is mandatory for high-volume sensor deployments. For battery-powered edge devices (wireless cameras, drones, wearables), the constraint is even more severe: as "The Energy of Transmission" (@sec-ml-systems-bottleneck-principle-3514) established, radio transmission costs `{python} EnergyTransmission.ratio_str`$\times$ more energy than local inference, making cloud offloading physically impossible for battery-powered devices regardless of available bandwidth.
|
||
|
||
### Edge ML Benefits and Deployment Challenges {#sec-ml-systems-edge-ml-benefits-deployment-challenges-b2d0}
|
||
|
||
\index{Edge ML!distributed processing} \index{Edge ML!deployment challenges}
|
||
\index{Edge ML!privacy benefits}
|
||
Edge ML spans wearables, industrial sensors, and smart home appliances that process data locally[^fn-iot-data-wall] without depending on central servers. @fig-energy-per-inference quantifies the physical imperative: full-system energy per inference spans eight orders of magnitude across deployment paradigms, from ~10 µJ for a TinyML keyword spotter to ~1 kJ for a cloud LLM query. This 100,000,000$\times$ gap is not an engineering shortcoming to be optimized away; it reflects the irreducible costs of data movement, cooling, and network overhead that separate deployment tiers. Because edge devices operate within tight power envelopes, their memory bandwidth of 25--100 GB/s constrains deployable models to 100 MB--1 GB of parameters. This constraint, in turn, motivates the optimization techniques covered in @sec-model-compression, which achieve 2--4$\times$ speedup by compressing models to fit within these hardware budgets. The payoff extends beyond compute: processing 1000 camera feeds locally avoids 1 Gbps uplink costs because raw data never leaves the device, reducing cloud expenses by \$10,000--100,000 annually.
|
||
|
||
[^fn-iot-data-wall]: **IoT Data Wall**: Connected devices are projected to exceed 25 billion by 2030, each generating continuous sensor streams. The aggregate $D_{vol}$ from these devices already exceeds global network $BW$ capacity for centralized ingestion, making local edge processing not an optimization but a physical necessity: the data simply cannot all reach the cloud. \index{IoT!Data Wall}
|
||
|
||
::: {#fig-energy-per-inference fig-env="figure" fig-pos="htb" fig-cap="**Energy Per Inference Across Deployment Paradigms.** Full-system energy consumption per inference spans eight orders of magnitude, from ~10 µJ for TinyML keyword spotting to ~1 kJ for a cloud LLM query. This gap is not an engineering shortcoming—it reflects the physics of data movement, cooling, and network overhead that separates deployment tiers. The 100,000,000× difference explains why always-on sensing is only feasible at the TinyML tier." fig-alt="Horizontal log-scale bar chart showing energy per inference for five workloads across four deployment paradigms. TinyML keyword spotting at 10 microjoules, Mobile MobileNet at 50 millijoules, Edge ResNet-50 at 500 millijoules, Cloud ResNet-50 at 10 joules, and Cloud GPT-4 query at 1 kilojoule."}
|
||
```{python}
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ ENERGY PER INFERENCE: LOG-SCALE BAR CHART
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: @fig-energy-per-inference — Edge ML Benefits section
|
||
# │
|
||
# │ Goal: Visualize the 8-order-of-magnitude energy gap across paradigms.
|
||
# │ Show: Why always-on sensing requires TinyML and why cloud offloading
|
||
# │ is physically impossible for battery-powered devices.
|
||
# │ How: Horizontal bar chart on log scale using existing energy data
|
||
# │ from the energy-inference-calc Python block.
|
||
# │
|
||
# │ Imports: sys, os, numpy, mlsysim.core.viz
|
||
# │ Exports: (figure output)
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
import sys
|
||
import os
|
||
import numpy as np
|
||
|
||
sys.path.insert(0, ".")
|
||
from mlsysim import viz
|
||
|
||
fig, ax, COLORS, plt = viz.setup_plot(figsize=(8, 3.5))
|
||
|
||
# --- Data (energy per inference, full-system estimates) ---
|
||
workloads = [
|
||
"TinyML\nKeyword Spotting",
|
||
"Mobile\nMobileNet (NPU)",
|
||
"Edge\nResNet-50 (Jetson)",
|
||
"Cloud\nResNet-50 (A100)",
|
||
"Cloud\nGPT-4 Query",
|
||
]
|
||
energy_j = [1e-5, 5e-2, 5e-1, 1e1, 1e3]
|
||
|
||
paradigm_colors = [
|
||
COLORS["OrangeLine"], # TinyML
|
||
COLORS["BlueLine"], # Mobile
|
||
COLORS["GreenLine"], # Edge
|
||
COLORS["RedLine"], # Cloud (ResNet)
|
||
COLORS["RedLine"], # Cloud (GPT-4)
|
||
]
|
||
|
||
# --- Plot (horizontal log-scale bars) ---
|
||
y_pos = np.arange(len(workloads))
|
||
bars = ax.barh(y_pos, energy_j, color=paradigm_colors, edgecolor="white",
|
||
height=0.6, alpha=0.85)
|
||
|
||
ax.set_xscale("log")
|
||
ax.set_yticks(y_pos)
|
||
ax.set_yticklabels(workloads, fontsize=9)
|
||
ax.set_xlabel("Energy per Inference (Joules)")
|
||
ax.set_xlim(1e-6, 1e5)
|
||
ax.invert_yaxis()
|
||
|
||
# Add value labels on bars
|
||
labels = ["~10 µJ", "~50 mJ", "~500 mJ", "~10 J", "~1 kJ"]
|
||
for bar, label in zip(bars, labels):
|
||
width = bar.get_width()
|
||
ax.text(width * 2.5, bar.get_y() + bar.get_height() / 2,
|
||
label, va="center", ha="left", fontsize=8, fontweight="bold",
|
||
color=COLORS["primary"])
|
||
|
||
# Annotate the 8-order-of-magnitude gap with a double-headed arrow
|
||
ax.annotate(
|
||
"", xy=(8e3, 0), xytext=(8e3, 4),
|
||
arrowprops=dict(arrowstyle="<->", color=COLORS["crimson"], lw=1.5),
|
||
)
|
||
ax.text(1.5e4, 2, "100,000,000×", fontsize=9, fontweight="bold",
|
||
color=COLORS["crimson"], ha="left", va="center", rotation=90)
|
||
|
||
ax.grid(axis="x", alpha=0.3)
|
||
ax.grid(axis="y", visible=False)
|
||
plt.show()
|
||
```
|
||
:::
|
||
|
||
### The Data Locality Invariant {#sec-ml-systems-data-locality-invariant}
|
||
|
||
\index{Data Locality Invariant!definition} \index{bandwidth-latency trade-off}The decision between local edge processing and remote cloud processing is governed by the **Data Locality Invariant**. This principle establishes that data *must* stay local when the time to transmit it exceeds the total time for remote processing (including network latency and remote compute).
|
||
|
||
::: {.callout-definition title="The Data Locality Invariant"}
|
||
|
||
***The Data Locality Invariant*** states that a workload necessitates local processing whenever the transmission delay ($D_{vol}/BW_{net}$) dominates the remote response time:
|
||
$\text{Data Locality} \iff \frac{D_{vol}}{BW_{net}} > L_{net} + \frac{O}{R_{peak, remote}}$
|
||
|
||
1. **Significance (Quantitative):** It defines the **Locality Crossover**, the point where adding cloud compute (increasing $R_{peak}$) yields zero benefit because the "Pipe" ($BW_{net}$) is too narrow for the "Volume" ($D_{vol}$).
|
||
2. **Distinction (Durable):** Unlike **The Iron Law**, which optimizes for **Time**, the Locality Invariant optimizes for **Architectural Feasibility** by identifying when network physics forbids remote offloading.
|
||
3. **Common Pitfall:** A frequent misconception is that 5G/6G "solves" locality. While these improve $BW_{net}$, they do not reduce $L_{net}$ below the Light Barrier, meaning latency-critical tasks remain inherently local.
|
||
|
||
:::
|
||
|
||
::: {.callout-notebook title="Napkin Math: The Locality Crossover"}
|
||
|
||
\index{locality crossover!worked example}**Problem**: Should a drone's object avoidance system (4K, 60 FPS) offload to the cloud?
|
||
|
||
**The Variables**:
|
||
|
||
- **Data ($D_{vol}$)**: 4K frame ≈ `{python} DataLocalityInvariant.frame_mb_str` MB.
|
||
- **Bandwidth ($BW_{net}$)**: `{python} DataLocalityInvariant.net_bw_str` Mbps home broadband (up).
|
||
- **Remote Latency ($L_{net}$)**: `{python} DataLocalityInvariant.remote_ms_str` ms (round-trip + remote compute).
|
||
|
||
**The Calculation**:
|
||
|
||
1. **Transmission Time**: `{python} DataLocalityInvariant.frame_mb_str` MB $\times$ 8 bits / `{python} DataLocalityInvariant.net_bw_str` Mbps = **`{python} DataLocalityInvariant.tx_time_ms_str` ms**.
|
||
2. **Remote Response**: **`{python} DataLocalityInvariant.remote_ms_str` ms**.
|
||
|
||
**The Systems Conclusion**: Since `{python} DataLocalityInvariant.tx_time_ms_str` ms $\gg$ `{python} DataLocalityInvariant.remote_ms_str` ms, the system is **Bandwidth Blocked**. The cloud could have an infinite processor ($R_{peak} = \infty$), but the drone would still crash because it can't move the bits fast enough. This workload is **Locality Mandatory**.
|
||
:::
|
||
|
||
Edge ML provides quantifiable benefits that address key cloud limitations.
|
||
The most immediate is latency: response times drop from 100--500 ms in cloud deployments to 1--50 ms at the edge, enabling safety-critical applications that demand real-time response. Bandwidth savings compound this advantage—a retail store with 50 cameras streaming video can reduce transmission requirements from 100 Mbps (costing \$1,000--2,000 monthly) to less than 1 Mbps by processing locally and transmitting only metadata, a 99% reduction. Privacy strengthens in turn, because local processing eliminates transmission risks and simplifies regulatory compliance. For industrial deployments, operational resilience is the decisive advantage: systems continue functioning during network outages, a property essential for manufacturing, healthcare, and building management applications where downtime carries immediate cost.
|
||
|
||
These benefits carry corresponding limitations that compound as deployments scale. Limited computational resources[^fn-edge-resource-limits] sharply constrain model complexity: edge servers often provide an order of magnitude or more less processing throughput than cloud infrastructure, limiting deployable models to millions rather than billions of parameters. Managing distributed networks introduces complexity that scales nonlinearly with deployment size, because coordinating version control and updates across thousands of devices requires sophisticated orchestration systems[^fn-edge-fleet-ops], and hardware heterogeneity across diverse platforms demands different optimization strategies for each target.
|
||
|
||
Security challenges intensify because edge devices are physically accessible: equipment deployed in retail stores or public infrastructure faces tampering risks that centralized datacenters do not, requiring hardware-based protection mechanisms such as secure boot, encrypted storage, and tamper-evident enclosures. Initial deployment costs of \$500-2,000 per edge server compound across locations: instrumenting 1,000 sites requires \$500,000-2,000,000 upfront, though these capital costs are offset by lower long-term operational expenses compared to equivalent cloud spending.
|
||
|
||
[^fn-edge-resource-limits]: **Edge Server Constraints**: Edge hardware typically provides 1--8 GB memory and 5--50 W power, roughly 100$\times$ less than cloud servers in both dimensions. These constraints cap deployable model size at millions (not billions) of parameters, making the compression techniques in @sec-model-compression essential for achieving sustainable inference duty cycles within the thermal envelope. \index{Edge Hardware!resource limits}
|
||
|
||
[^fn-edge-fleet-ops]: **Edge Fleet Coordination**: Managing thousands of distributed edge devices introduces failure modes absent from centralized cloud: intermittent connectivity causes model version drift, hardware heterogeneity requires per-target optimization, and physical accessibility makes firmware rollbacks costly. These operational patterns are examined in @sec-ml-operations. \index{Edge Orchestration!fleet challenges}
|
||
|
||
To make these trade-offs concrete, the following worked example applies *edge inference sizing* to a realistic retail deployment scenario.
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: edge-sizing
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ EDGE INFERENCE SIZING: RETAIL DEPLOYMENT
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: "Edge Inference Sizing" worked example callout in
|
||
# │ @sec-ml-systems-realtime-industrial-iot-systems-373a
|
||
# │
|
||
# │ Goal: Select the cost-optimal edge accelerator for a 500-store YOLOv8 Nano
|
||
# │ deployment running 20 cameras at 15 FPS per store.
|
||
# │ Show: That right-sized purpose-built accelerators (Coral at 4 TOPS/W) yield
|
||
# │ lower 3-year fleet TCO than over-provisioned workstation-class hardware.
|
||
# │ How: Compute sustained GFLOPS requirement from inference rate × model FLOPs;
|
||
# │ apply calc_fleet_tco() with hardware TDP and CLOUD_ELECTRICITY_PER_KWH.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (GFLOPs, TFLOPs, CLOUD_ELECTRICITY_PER_KWH,
|
||
# │ HOURS_PER_YEAR, USD, watt, ureg), mlsysim.formulas (calc_fleet_tco)
|
||
# │ Exports: EdgeSizing.stores_str, EdgeSizing.cameras_per_store_str,
|
||
# │ EdgeSizing.fps_str, EdgeSizing.inf_per_sec_str,
|
||
# │ EdgeSizing.yolo_gflops_str, EdgeSizing.sustained_gf_str,
|
||
# │ EdgeSizing.req_tflops_str, EdgeSizing.coral_tops_str,
|
||
# │ EdgeSizing.coral_power_w_str, EdgeSizing.coral_tco_k_str,
|
||
# │ EdgeSizing.jetson_tops_str, EdgeSizing.jetson_power_w_str,
|
||
# │ EdgeSizing.jetson_tco_k_str, EdgeSizing.nuc_tops_str,
|
||
# │ EdgeSizing.nuc_power_w_str, EdgeSizing.nuc_tco_k_str,
|
||
# │ EdgeSizing.power_ratio_str, EdgeSizing.elec_cost_str,
|
||
# │ EdgeSizing.years_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim import Hardware, Models
|
||
from mlsysim.core.constants import GFLOPs, CLOUD_ELECTRICITY_PER_KWH, HOURS_PER_YEAR, TFLOPs, USD, watt, ureg
|
||
from mlsysim.core.formulas import calc_fleet_tco
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class EdgeSizing:
|
||
"""
|
||
Namespace for Edge Inference Sizing.
|
||
Scenario: Hardware selection for retail chain (500 stores).
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
# Scenario
|
||
stores = 500
|
||
cameras_per_store = 20
|
||
fps = 15
|
||
headroom = 2.0
|
||
|
||
# Model
|
||
model = Models.Vision.YOLOv8_Nano
|
||
|
||
# Hardware Candidates
|
||
coral = Hardware.Edge.Coral
|
||
jetson = Hardware.Edge.JetsonOrinNX
|
||
nuc = Hardware.Edge.NUC_Movidius
|
||
|
||
# Costs (Scenario specific, overwriting defaults if needed or using external)
|
||
coral_cost = 150
|
||
jetson_cost = 600
|
||
nuc_cost = 400
|
||
years = 3
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
# Step 1: Throughput
|
||
inf_per_sec = cameras_per_store * fps
|
||
# Step 2: YOLOv8 Nano Inference FLOPs from Models Twin
|
||
yolo_flops = model.inference_flops if model.inference_flops else model.training_ops
|
||
|
||
sustained_gflops = (inf_per_sec * yolo_flops).m_as(GFLOPs)
|
||
required_tflops = (sustained_gflops * headroom * GFLOPs).m_as(TFLOPs)
|
||
|
||
# Step 3: TCO
|
||
coral_tco = calc_fleet_tco(coral_cost, coral.tdp, stores, years, CLOUD_ELECTRICITY_PER_KWH)
|
||
jetson_tco = calc_fleet_tco(jetson_cost, jetson.tdp, stores, years, CLOUD_ELECTRICITY_PER_KWH)
|
||
nuc_tco = calc_fleet_tco(nuc_cost, nuc.tdp, stores, years, CLOUD_ELECTRICITY_PER_KWH)
|
||
|
||
coral_fleet_capex = coral_cost * stores
|
||
coral_power_opex = coral_tco - coral_fleet_capex
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
if required_tflops > coral.peak_flops.m_as(TFLOPs/second):
|
||
# Note: Coral is 4 TOPS (INT8). YOLO is FP32/INT8?
|
||
# The original code used 4 TOPS vs 2 TFLOPS required.
|
||
pass
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
stores_str = f"{stores}"
|
||
cameras_per_store_str = f"{cameras_per_store}"
|
||
fps_str = f"{fps}"
|
||
headroom_str = f"{headroom:.0f}"
|
||
inf_per_sec_str = f"{inf_per_sec}"
|
||
|
||
yolo_gflops_str = fmt(yolo_flops.m_as(GFLOPs), precision=1)
|
||
sustained_gf_str = fmt(sustained_gflops, precision=0)
|
||
req_tflops_str = fmt(required_tflops, precision=0)
|
||
|
||
coral_cost_str = f"{coral_cost}"
|
||
coral_power_w_str = f"{coral.tdp.m_as(watt):.0f}"
|
||
coral_tops_str = f"{coral.peak_flops.m_as(TFLOPs/second):.0f}"
|
||
|
||
jetson_cost_str = f"{jetson_cost}"
|
||
jetson_power_range_str = "10-40"
|
||
jetson_tops_str = f"{jetson.peak_flops.m_as(TFLOPs/second):.0f}"
|
||
|
||
nuc_cost_str = f"{nuc_cost}"
|
||
nuc_power_w_str = f"{nuc.tdp.m_as(watt):.0f}"
|
||
nuc_tops_str = f"{nuc.peak_flops.m_as(TFLOPs/second):.0f}"
|
||
|
||
coral_fleet_k_str = fmt(coral_fleet_capex / 1000, precision=0)
|
||
coral_tco_k_str = fmt(coral_tco / 1000, precision=0)
|
||
jetson_tco_k_str = fmt(jetson_tco / 1000, precision=0)
|
||
nuc_tco_k_str = fmt(nuc_tco / 1000, precision=0)
|
||
|
||
# Additional Outputs for Prose
|
||
jetson_fleet_k_str = fmt((jetson_cost * stores) / 1000, precision=0, commas=False)
|
||
nuc_fleet_k_str = fmt((nuc_cost * stores) / 1000, precision=0, commas=False)
|
||
coral_pwr_k_str = fmt(coral_power_opex / 1000, precision=0, commas=False)
|
||
|
||
years_str = f"{years}"
|
||
hours_per_year_str = f"{HOURS_PER_YEAR}"
|
||
coral_power_cost_k_str = fmt(coral_power_opex / 1000, precision=1)
|
||
|
||
power_ratio_str = fmt(jetson.tdp.m_as(watt) / coral.tdp.m_as(watt), precision=0, commas=False)
|
||
elec_cost_str = f"{CLOUD_ELECTRICITY_PER_KWH.m_as(USD / ureg.kilowatt_hour)}"
|
||
|
||
# Cloud alternative: 500 stores each need ~1 GPU instance at $0.75/hr (A10G on-demand)
|
||
cloud_gpu_price_per_hr = 0.75
|
||
cloud_gpus_per_store = 1
|
||
cloud_annual = stores * cloud_gpus_per_store * HOURS_PER_YEAR * cloud_gpu_price_per_hr
|
||
cloud_tco_3yr = cloud_annual * years
|
||
cloud_cost_k_str = fmt(cloud_tco_3yr / 1000, precision=0, commas=True)
|
||
|
||
int8_throughput_mult = 4 # standard INT8 vs FP32 throughput ratio
|
||
int8_mult_str = fmt(int8_throughput_mult, precision=0, commas=False)
|
||
cost_ratio_str = fmt(jetson_cost // coral_cost, precision=0, commas=False)
|
||
```
|
||
|
||
::: {.callout-notebook title="Edge Inference Sizing"}
|
||
**Scenario**: A smart retail chain deploying person detection across `{python} EdgeSizing.stores_str` stores, each with `{python} EdgeSizing.cameras_per_store_str` cameras at `{python} EdgeSizing.fps_str` FPS.
|
||
|
||
**Requirements Analysis**
|
||
|
||
| **Metric** | **Calculation** | **Result** |
|
||
|:-------------------------|:----------------------------------------------------------------------------------------------------|:-----------------------------------------------------|
|
||
| **Inferences per store** | `{python} EdgeSizing.cameras_per_store_str` cameras$\times$ `{python} EdgeSizing.fps_str` FPS | `{python} EdgeSizing.inf_per_sec_str` inferences/sec |
|
||
| **Model compute** | YOLOv8-nano: `{python} EdgeSizing.yolo_gflops_str` GFLOPs/inference | `{python} EdgeSizing.sustained_gf_str` GFLOPs/sec |
|
||
| **Required throughput** | `{python} EdgeSizing.sustained_gf_str` GFLOPs$\times$ `{python} EdgeSizing.headroom_str` (headroom) | ~`{python} EdgeSizing.req_tflops_str` TFLOPS |
|
||
|
||
\index{edge accelerators!deployment selection}
|
||
\index{embedded GPU accelerators!edge deployment}
|
||
**Hardware Selection**
|
||
|
||
| **Edge Device** | **INT8 TOPS** | **Power** | **Unit Cost** | **Fleet Cost** |
|
||
|:--------------------------|-------------------------------------------:|:-----------------------------------------------|:------------------------------------------|-------------------------------------------------:|
|
||
| **NVIDIA Jetson Orin NX** | `{python} EdgeSizing.jetson_tops_str` TOPS | `{python} EdgeSizing.jetson_power_range_str` W | USD `{python} EdgeSizing.jetson_cost_str` | USD `{python} EdgeSizing.jetson_fleet_k_str`,000 |
|
||
| **Intel NUC + Movidius** | `{python} EdgeSizing.nuc_tops_str` TOPS | `{python} EdgeSizing.nuc_power_w_str` W | USD `{python} EdgeSizing.nuc_cost_str` | USD `{python} EdgeSizing.nuc_fleet_k_str`,000 |
|
||
| **Google Coral Dev** | `{python} EdgeSizing.coral_tops_str` TOPS | `{python} EdgeSizing.coral_power_w_str` W | USD `{python} EdgeSizing.coral_cost_str` | USD `{python} EdgeSizing.coral_fleet_k_str`,000 |
|
||
|
||
**Decision**: At `{python} EdgeSizing.req_tflops_str` TFLOPS required and INT8 quantization providing ~`{python} EdgeSizing.int8_mult_str`$\times$ effective throughput, the Coral Dev Board (`{python} EdgeSizing.coral_tops_str` TOPS) meets requirements at 1/`{python} EdgeSizing.cost_ratio_str` the cost of Jetson, with `{python} EdgeSizing.power_ratio_str`$\times$ lower power consumption. Note: peak TOPS should be derated by ~50% for realistic sustained throughput (due to operator support, data loading, and memory constraints); the `{python} EdgeSizing.headroom_str`$\times$ engineering headroom partially accounts for this gap.
|
||
|
||
**TCO over `{python} EdgeSizing.years_str` years** (Coral): Hardware USD `{python} EdgeSizing.coral_fleet_k_str` K + Power (USD `{python} EdgeSizing.coral_power_w_str`$\times$ `{python} EdgeSizing.stores_str`$\times$ `{python} EdgeSizing.hours_per_year_str` h$\times$ `{python} EdgeSizing.years_str` yr$\times$ USD `{python} EdgeSizing.elec_cost_str`/kWh) = USD `{python} EdgeSizing.coral_fleet_k_str` K + USD `{python} EdgeSizing.coral_pwr_k_str` K = **USD `{python} EdgeSizing.coral_tco_k_str`,000 total** vs. cloud inference at ~USD `{python} EdgeSizing.cloud_cost_k_str` K.
|
||
:::
|
||
|
||
### Real-Time Industrial and IoT Systems {#sec-ml-systems-realtime-industrial-iot-systems-373a}
|
||
|
||
\index{Edge ML!autonomous vehicles} \index{Edge ML!industrial IoT} \index{Edge ML!smart retail} \index{predictive maintenance!edge deployment}
|
||
\index{autonomous vehicles!latency requirements}
|
||
Industries deploy Edge ML widely where low latency, data privacy, and operational resilience justify the additional complexity. Autonomous vehicles represent the most demanding application, where safety-critical decisions must occur within milliseconds based on sensor data that cannot be transmitted to remote servers. Systems like Tesla's Full Self-Driving process inputs from multiple cameras at high frame rates through custom edge hardware, making driving decisions with end-to-end latency on the order of milliseconds. This response time is infeasible with cloud processing due to network delays.
|
||
|
||
Smart retail environments demonstrate edge ML's practical advantages for privacy-sensitive, bandwidth-intensive applications. Amazon Go[^fn-amazon-go-edge] stores process video from hundreds of cameras through local edge servers, tracking customer movements and item selections to enable checkout-free shopping. This edge-based approach addresses both technical and privacy concerns. Transmitting high-resolution video from hundreds of cameras would require substantial sustained bandwidth, while local processing keeps raw video on premises, reducing exposure and simplifying compliance.
|
||
|
||
\index{quality control!edge processing}
|
||
\index{IoT devices!deployment scale}
|
||
The Industrial IoT[^fn-industry40-feedback] uses edge ML for applications where millisecond-level responsiveness directly impacts production efficiency and worker safety. Manufacturing facilities deploy edge ML systems for real-time quality control, with vision systems inspecting welds at speeds exceeding 60 parts per minute and predictive maintenance[^fn-predictive-maint-edge] applications monitoring over 10,000 industrial assets per facility. Across various manufacturing sectors, this approach has demonstrated 25–35% reductions in unplanned downtime—savings that justify the additional deployment complexity.
|
||
|
||
Smart buildings use edge ML to optimize energy consumption while maintaining operational continuity during network outages. Commercial buildings equipped with edge-based building management systems process data from thousands of sensors monitoring temperature, occupancy, air quality, and energy usage. This reduces cloud transmission requirements by an order of magnitude or more while enabling sub-second response times. Healthcare applications similarly use edge ML for patient monitoring and surgical assistance, maintaining HIPAA compliance through local processing while supporting low-latency workflows for real-time guidance.
|
||
|
||
These applications share a common assumption: the edge device is stationary and plugged into wall power. Recall the Iron Law (@eq-iron-law-extended): edge deployment eliminated the $D_{vol}/BW_{IO}$ network term that dominated cloud inference, but it still assumes unlimited energy. A factory edge server consuming 500 W around the clock is unremarkable when connected to mains power. Billions of users, however, carry their computing devices with them, and those devices run on fixed battery budgets. When we shift from stationary edge infrastructure to the smartphone in a user's pocket, a new term enters the optimization: $\text{Energy} = \text{Power} \times T$. The dominant constraint changes from latency to *energy per inference*, and with it, the entire engineering calculus.
|
||
|
||
[^fn-amazon-go-edge]: **Amazon Go**: The system's use of local edge servers is a direct response to the immense data volume from hundreds of in-store cameras. This architecture avoids having to upload the raw video—which would saturate a multi-gigabit uplink—while also keeping sensitive customer footage on-premises. The edge-first design is necessitated by the sheer scale of data processed, which can exceed 1 TB per hour in a single store. \index{Amazon Go!bandwidth constraint}
|
||
|
||
[^fn-industry40-feedback]: **Industry 4.0**: The fourth industrial revolution integrates ML into the sensor-actuator feedback loop on factory floors. The systems consequence is that the control loop latency ($L_{lat}$) must be shorter than the physical process it governs: a welding robot that detects a defect at 60 Hz has 16.7 ms to halt, a budget only edge inference can meet. \index{Industry 4.0!control loop latency}
|
||
|
||
[^fn-predictive-maint-edge]: **Predictive Maintenance**: Models that analyze high-frequency sensor data (e.g., vibration, thermal) to forecast equipment failure, enabling the simultaneous monitoring of thousands of assets. The "additional deployment complexity" mentioned stems directly from the edge requirement for continuous, 24/7 on-device inference. This imposes a strict power budget where the entire sensor and model must often operate on less than 1 watt, a major constraint driving model architecture and quantization choices. \index{Predictive Maintenance!edge duty cycle}
|
||
|
||
## Mobile ML: Offline Intelligence {#sec-ml-systems-mobile-ml-personal-offline-intelligence-0983}
|
||
|
||
\index{Mobile ML!battery constraints} \index{Mobile ML!thermal envelope}Edge ML solves the distance problem that limits cloud deployments, achieving sub-100 ms latency through local processing. However, edge devices remain tethered to stationary infrastructure—gateways, factory servers, retail edge systems. Users do not stay in one place, so neither can their AI. To bring ML capabilities to users in motion, we must solve a different constraint: the **Battery**. Unlike plugged-in edge servers that can consume hundreds of watts continuously, mobile devices must operate for hours or days on fixed energy budgets.
|
||
|
||
Mobile ML addresses this challenge by integrating machine learning directly into portable devices like smartphones and tablets, providing users with real-time, personalized capabilities. This paradigm excels when user privacy, offline operation, and immediate responsiveness matter more than computational sophistication, supporting applications such as voice recognition, computational photography[^fn-computational-photo-ml], and health monitoring while maintaining data privacy through on-device computation. These battery-powered devices must balance performance with power efficiency and thermal management, making them suited to frequent, short-duration AI tasks.
|
||
|
||
\index{depthwise separable convolutions!power reduction}
|
||
The mobile environment introduces a critical constraint absent from stationary deployments: *energy per inference* becomes a first-order design parameter. In the Iron Law (@eq-iron-law-extended), cloud and edge systems optimize for minimizing $T$—total latency. Mobile systems face an additional constraint: $\text{Energy} = \text{Power} \times T$, and the Power Wall (@eq-power-scaling) caps sustained power at `{python} LighthouseModels.mobile_tdp_range_str` W. In Archetype terms, a Compute Beast workload like image classification must be transformed through architectural efficiency (e.g., depthwise separable convolutions[^fn-depthwise-mobile-efficiency] in MobileNet) to become a Compute Beast (efficient)—reducing FLOPs by `{python} LighthouseModels.mobilenet_flops_reduction_str`$\times$ while preserving accuracy. This is not merely optimization; it represents a qualitative shift in the arithmetic intensity trade-off, accepting lower peak throughput in exchange for sustainable operation within a `{python} LighthouseModels.mobile_tdp_range_str` W thermal envelope.
|
||
|
||
[^fn-computational-photo-ml]: **Computational Photography**: Uses ML algorithms (e.g., multi-frame fusion, neural denoising) to overcome the physical limits of small mobile camera sensors. This exemplifies the mobile computing trade-off, as a pipeline of 10-15 models must execute within the user's perceived shutter delay (~200 ms) while adhering to a strict, shared `{python} LighthouseModels.mobile_tdp_range_str` W thermal budget. \index{Computational Photography!pipeline constraint}
|
||
|
||
[^fn-depthwise-mobile-efficiency]: **Depthwise Separable Convolutions**: An architectural factorization that splits a standard convolution into a depthwise spatial filter and a pointwise channel mixer, reducing FLOPs by 8--9$\times$ for typical layer configurations. This reduction is not merely an efficiency improvement but a prerequisite for real-time vision on mobile devices, where the Power Wall caps sustained computation at `{python} LighthouseModels.mobile_tdp_range_str` W. \index{Depthwise Separable Convolution!mobile power constraint}
|
||
|
||
We define this paradigm formally as *Mobile ML*.
|
||
|
||
::: {.callout-definition title="Mobile ML"}
|
||
|
||
***Mobile Machine Learning***\index{Mobile ML!definition} is the deployment paradigm bounded by **Thermal Design Power (TDP)** and battery energy.
|
||
|
||
1. **Significance (Quantitative):** It is constrained by the **Heat Dissipation** capacity of passive cooling (typically 2–3W), requiring architectures that prioritize **Sustained Energy Efficiency** over peak throughput ($R_{peak}$).
|
||
2. **Distinction (Durable):** Unlike **Edge ML**, which may have active cooling, Mobile ML must operate within a **Personal Energy Budget**. Unlike **TinyML**, it still provides a rich OS and multi-watt compute capacity.
|
||
3. **Common Pitfall:** A frequent misconception is that Mobile ML performance is a fixed value. In reality, it is a **Time-Varying Constraint**: performance often drops as the device hits its **Thermal Wall**, triggering throttling that reduces the duty cycle ($\eta$).
|
||
|
||
:::
|
||
|
||
These constraints play out concretely in @fig-mobile-ml, which organizes the unique characteristics of mobile deployment. The **Characteristics** branch emphasizes sensor integration and on-device processing, which enables key **Benefits** like real-time processing and enhanced privacy. However, the **Challenges** branch reveals battery life constraints and limited computational resources that force engineers to optimize for sustained efficiency over raw performance.
|
||
|
||
::: {#fig-mobile-ml fig-env="figure" fig-pos="t" fig-cap="**Mobile ML Decomposition.** Characteristics, benefits, challenges, and representative applications of mobile machine learning, where on-device processing and hardware acceleration balance computational efficiency, battery life, and model performance on smartphones and tablets." fig-alt="Tree diagram with Mobile ML branching to four categories: Characteristics, Benefits, Challenges, and Examples. Each lists items like on-device processing, real-time response, battery constraints, and voice recognition."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
|
||
\tikzset{
|
||
Box/.style={inner xsep=2pt,
|
||
draw=GreenLine,
|
||
fill=GreenL!50,
|
||
node distance=0.4,
|
||
line width=0.75pt,
|
||
anchor=west,
|
||
text width=32mm,align=flush center,
|
||
minimum width=32mm, minimum height=9.5mm
|
||
},
|
||
Box2/.style={Box,draw=BlueLine,fill=BlueL!50, text width=30mm, minimum width=30mm
|
||
},
|
||
Box3/.style={Box,draw=OrangeLine,fill=OrangeL!40, text width=35mm, minimum width=35mm
|
||
},
|
||
Box4/.style={Box,draw=VioletLine,fill=VioletL2!40, text width=32mm, minimum width=32mm
|
||
},
|
||
Line/.style={line width=1.0pt,black!50,text=black,-{Triangle[width=0.8*6pt,length=0.98*6pt]}},
|
||
}
|
||
\node[Box4, fill=VioletL2!90!violet!50,](B1){Characteristics};
|
||
\node[Box2,right=2 of B1,fill=BlueL](B2){Benefits};
|
||
\node[Box,right=2 of B2,fill=GreenL](B3){Challenges};
|
||
\node[Box3,right=2 of B3,fill=OrangeL](B4){Examples};
|
||
\node[Box,draw=OliveLine,fill=OliveL!30, minimum height=11.5mm,
|
||
above=1of $(B2.north east)!0.5!(B3.north west)$](B0){Mobile ML};
|
||
%
|
||
\node[Box4,below=0.7 of B1](B11){On-Device Processing};
|
||
\node[Box4,below=of B11](B12){Battery-Powered Operation};
|
||
\node[Box4,below=of B12](B13){Sensor Integration};
|
||
\node[Box4,below=of B13](B14){Optimized Frameworks};
|
||
%
|
||
\node[Box2,below=0.7 of B2](B21){Real-Time Processing};
|
||
\node[Box2,below=of B21](B22){Enhanced Privacy};
|
||
\node[Box2,below=of B22](B23){Offline Functionality};
|
||
\node[Box2,below=of B23](B24){Personalized Experience};
|
||
%
|
||
\node[Box,below=0.7 of B3](B31){Limited Computational Resources};
|
||
\node[Box,below=of B31](B32){Battery Life Constraints};
|
||
\node[Box,below=of B32](B33){Storage Limitations};
|
||
\node[Box,below=of B33](B34){Model Optimization Requirements};
|
||
%
|
||
\node[Box3,below=0.7 of B4](B41){Voice Recognition};
|
||
\node[Box3,below=of B41](B42){Computational Photography};
|
||
\node[Box3,below=of B42](B43){Health Monitoring};
|
||
\node[Box3,below=of B43](B44){Real-Time Translation};
|
||
%
|
||
\foreach \i in{1,2,3,4}{
|
||
\draw[Line](B1.west)--++(180:0.5)|-(B1\i);
|
||
}
|
||
\foreach \i in{1,2,3,4}{
|
||
\draw[Line](B2.west)--++(180:0.5)|-(B2\i);
|
||
}
|
||
\foreach \i in{1,2,3,4}{
|
||
\draw[Line](B3.west)--++(180:0.5)|-(B3\i);
|
||
}
|
||
\foreach \i in{1,2,3,4}{
|
||
\draw[Line](B4.west)--++(180:0.5)|-(B4\i);
|
||
}
|
||
\foreach \x in{1,2,3,4}{
|
||
\draw[Line](B0)-|(B\x);
|
||
}
|
||
\end{tikzpicture}
|
||
|
||
```
|
||
:::
|
||
|
||
The battery life and resource constraints listed above translate directly into engineering requirements. Always-on ML features incur what we call *the battery tax*, as the following analysis illustrates.
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: battery-tax
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ BATTERY TAX: ALWAYS-ON MOBILE ML POWER BUDGET
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Callout "The Battery Tax" — shows why continuous mobile ML drains batteries
|
||
# │
|
||
# │ Goal: Quantify the energy cost of always-on mobile inference.
|
||
# │ Show: That a 2W detector depletes a standard phone battery in under 8 hours.
|
||
# │ How: Calculate runtime from power draw and battery capacity (Wh).
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (PHONE_BATTERY_WH, OBJECT_DETECTOR_POWER_W)
|
||
# │ Exports: BatteryTax.pwr_w_str, BatteryTax.batt_wh_str,
|
||
# │ BatteryTax.runtime_str, BatteryTax.budget_pct_str,
|
||
# │ BatteryTax.runtime_frac
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim import Hardware
|
||
from mlsysim.core.constants import OBJECT_DETECTOR_POWER_W, ureg
|
||
from mlsysim.fmt import md_frac, fmt_percent, fmt
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class BatteryTax:
|
||
"""
|
||
Namespace for Battery Tax calculation.
|
||
Scenario: Always-on object detection draining a phone battery.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
phone = Hardware.Edge.Generic_Phone
|
||
power_draw = OBJECT_DETECTOR_POWER_W
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
battery_wh = phone.battery_capacity.to(ureg.Wh)
|
||
runtime_hours = (battery_wh / power_draw).to(ureg.hour)
|
||
daily_budget_pct = (power_draw * runtime_hours) / battery_wh * 100
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
check(runtime_hours.m_as(ureg.hour) <= 24, f"Always-on ML should drain battery fast, but got {runtime_hours:.1f} hours.")
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
runtime_str = fmt(runtime_hours.m_as(ureg.hour), precision=1, commas=False)
|
||
pwr_w_str = fmt(power_draw.m_as(ureg.watt), precision=0, commas=False)
|
||
batt_wh_str = fmt(battery_wh.m_as(ureg.Wh), precision=0, commas=False)
|
||
budget_pct_str = fmt(daily_budget_pct.m_as(''), precision=0, commas=False)
|
||
|
||
runtime_frac = md_frac(f"{batt_wh_str} Wh", f"{pwr_w_str} W", f"**{runtime_str} hours**")
|
||
```
|
||
|
||
::: {.callout-notebook title="The Battery Tax"}
|
||
|
||
\index{battery life!ML impact} \index{Mobile ML!energy budget}**Problem**: You want to deploy a "real-time" background object detector on a smartphone. The model consumes **`{python} BatteryTax.pwr_w_str` Watts** of continuous power when active. The phone has a standard **`{python} BatteryTax.batt_wh_str` Watt-hour (Wh)** battery.
|
||
|
||
**The Physics**:
|
||
|
||
1. **Ideal Runtime**: `{python} BatteryTax.runtime_frac`
|
||
2. **The Reality**: A user expects their phone to last 24 hours. Your single feature has just consumed **`{python} BatteryTax.budget_pct_str`%** of the entire daily energy budget in a few hours.
|
||
|
||
**The Engineering Conclusion**: You cannot simply "deploy" the model. You must use the techniques in @sec-model-compression (quantization, duty-cycling) to reduce the power to **<100 mW** if you want it to stay on all day.
|
||
:::
|
||
|
||
The battery constraint limits total energy consumption over time. However, even if we could ignore battery life—perhaps for a plugged-in tablet or a short demo—a second physical law intervenes: thermodynamics. Every watt of computation becomes a watt of heat that must be dissipated. In a data center, massive cooling systems remove this heat. In a thin, sealed mobile device with no fan, the only heat path is through the glass and metal casing to the surrounding air. This creates *the thermal wall*, a hard ceiling on sustained power consumption that exists independently of battery capacity.
|
||
|
||
```{python}
|
||
#| label: thermal-quant-calc
|
||
#| echo: false
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ THERMAL WALL: QUANTIZATION POWER REDUCTION
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Callout "The Thermal Wall" — shows limits of quantization for thermal
|
||
# │
|
||
# │ Goal: Demonstrate the limits of thermal dissipation on mobile devices.
|
||
# │ Show: That even 4× quantization cannot save heavy models from throttling.
|
||
# │ How: Contrast optimized power draw against the 3W mobile TDP limit.
|
||
# │
|
||
# │ Imports: mlsysim.book (fmt)
|
||
# │ Exports: ThermalQuantCalc.baseline_str, ThermalQuantCalc.quant_power_str,
|
||
# │ ThermalQuantCalc.quant_red_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class ThermalQuantCalc:
|
||
"""Namespace for Thermal Quant Calc."""
|
||
|
||
baseline_power_w_value = 12 # W, unoptimized LLM power
|
||
quant_reduction_value = 4 # FP32→INT8 power reduction
|
||
|
||
quant_power_w_value = baseline_power_w_value / quant_reduction_value # 12W / 4 = 3W
|
||
|
||
baseline_str = fmt(baseline_power_w_value, precision=0, commas=False) # e.g. "12" W
|
||
quant_power_str = fmt(quant_power_w_value, precision=0, commas=False) # e.g. "3" W
|
||
quant_red_str = fmt(quant_reduction_value, precision=0, commas=False) # e.g. "4" ×
|
||
```
|
||
|
||
::: {.callout-notebook title="The Thermal Wall"}
|
||
|
||
\index{thermal wall!mobile constraints} \index{Power Wall!mobile implications}**Problem**: Your unoptimized LLM requires **`{python} ThermalQuantCalc.baseline_str` W** peak compute. Can you deploy it on a mobile device?
|
||
|
||
**The Physics**:
|
||
|
||
1. **Thermal Design Power (TDP)**: A mobile SoC allows $\approx \mathbf{3 \text{ W}}$ for passive cooling.
|
||
2. **Temperature Rise**: At 10 W, the device temperature rises at $\approx 1^\circ\text{C}$ per second.
|
||
3. **Thermal Trip**: Within 60 seconds, the hardware reaches the **Thermal Trip Point** ($80^\circ\text{C}$), triggering OS throttling.
|
||
4. **The Result**: Your 100 FPS model suddenly drops to **30 FPS** to avoid melting the hardware.
|
||
|
||
**The Engineering Conclusion**: Quantization from FP32 to INT8 reduces power by approximately `{python} ThermalQuantCalc.quant_red_str`$\times$, but if the baseline power is `{python} ThermalQuantCalc.baseline_str` W, you are still at `{python} ThermalQuantCalc.quant_power_str` W—the absolute limit of the hardware. Physics sets a hard ceiling that no optimization can exceed.
|
||
:::
|
||
|
||
### Mobile ML Benefits and Resource Constraints {#sec-ml-systems-mobile-ml-benefits-resource-constraints-c568}
|
||
|
||
```{python}
|
||
#| label: mobile-battery-capacity
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ MOBILE BATTERY CAPACITY
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Prose immediately below (Mobile ML Benefits) and
|
||
# │ @sec-ml-systems-fallacies-pitfalls-3dfe (Fallacy on mobile power).
|
||
# │
|
||
# │ Goal: Provide phone battery capacity for energy budget calculations.
|
||
# │ Show: Typical smartphone battery is ~15 Wh.
|
||
# │ How: Read from phone hardware twin; format for inline ref.
|
||
# │
|
||
# │ Imports: mlsysim.Hardware (Generic_Phone), mlsysim.book (fmt)
|
||
# │ Exports: MobileBatteryCapacity.phone_battery_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim import Hardware
|
||
from mlsysim.fmt import fmt
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class MobileBatteryCapacity:
|
||
"""Namespace for mobile battery capacity."""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
h_phone = Hardware.Edge.Generic_Phone
|
||
phone_battery_wh = h_phone.battery_capacity.m_as('Wh') if h_phone.battery_capacity else 15
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
# (direct extraction from hardware twin)
|
||
|
||
# ┌── 3. GUARD (Invariants) ───────────────────────────────────────────
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
phone_battery_str = fmt(phone_battery_wh, precision=0)
|
||
```
|
||
|
||
\index{Mobile ML!NPU inference} \index{Mobile ML!memory bandwidth limits} \index{Neural Processing Unit (NPU)!mobile devices}
|
||
\index{on-device ML frameworks!mobile deployment}
|
||
\index{System-on-Chip (SoC)!mobile architecture}
|
||
Mobile devices exemplify intermediate constraints: `{python} MobileHardwareSpecs.mobile_ram_range_str` GB RAM (varying from mid-range to flagship), `{python} MobileHardwareSpecs.mobile_storage_range_str` storage, `{python} MobileHardwareSpecs.mobile_npu_range_str` TOPS AI compute through Neural Processing Units[^fn-npu-energy-efficiency] consuming `{python} LighthouseModels.mobile_tdp_range_str` W power. System-on-Chip architectures[^fn-soc-integration] integrate computation and memory to minimize energy costs. Memory bandwidth of `{python} MobileHardwareSpecs.mobile_bw_range_str` GB/s limits models to 10–100 MB parameters, requiring the aggressive optimization techniques that @sec-model-compression details. Battery constraints (`{python} MobileBatteryCapacity.phone_battery_str`–22 Wh capacity) make energy optimization critical: 1 W continuous ML processing reduces device lifetime from 24 to 18 hours. Specialized frameworks provide hardware-optimized inference enabling <`{python} MLSystemsSetup.mobile_latency_range_str` ms UI response times.
|
||
|
||
[^fn-soc-integration]: **System-on-Chip (SoC)**: By integrating CPU, GPU, and NPU cores with shared memory on a single die, the physical energy cost of data movement is minimized. This tight integration imposes the memory bandwidth constraint that limits mobile models to a 10--100 MB scale. The design is mandatory for battery life because accessing off-chip memory consumes over 100$\times$ more energy than on-chip access. \index{SoC!energy integration}
|
||
|
||
[^fn-npu-energy-efficiency]: **Neural Processing Unit (NPU)**: A dedicated hardware block on a mobile System-on-Chip whose circuits are exclusively designed for low-precision matrix multiplication. This specialization avoids the power-intensive instruction logic of a CPU, yielding a 10--100$\times$ gain in energy efficiency (TOPS/W) that allows high AI throughput to fit within a mobile device's strict <500 mW sustained power budget. \index{NPU!energy efficiency}
|
||
|
||
Mobile ML excels at delivering responsive, privacy-preserving user experiences. Real-time processing can reach sub-10 ms latency for some tasks, enabling imperceptible response in interactive applications. Stronger privacy properties emerge when sensitive inputs are processed locally—reducing data transmission and central storage—and on-device enclaves such as Apple's Secure Enclave can further protect sensitive computations like biometric processing[^fn-faceid-privacy], though the strength of privacy guarantees ultimately depends on overall system design and threat model. Offline functionality further differentiates mobile from cloud: navigation, translation, and media processing all run locally within mobile resource budgets, eliminating network dependency. Personalization rounds out the advantage, because models can exploit on-device signals and user context while keeping raw data local.
|
||
|
||
[^fn-faceid-privacy]: **Face ID**: Apple's biometric system projects 30,000 IR dots for 3D face mapping, processed entirely within the Secure Enclave, an isolated cryptographic coprocessor whose memory is inaccessible even to the main OS. Biometric templates never leave the device. This architecture achieves a 1:1,000,000 false acceptance rate while eliminating the network transmission that would otherwise create both a latency penalty and a data breach surface, illustrating that on-device constraints can simultaneously strengthen privacy and improve accuracy. \index{Face ID!privacy architecture}
|
||
|
||
These benefits require accepting tight resource constraints. Compared to cloud deployments, mobile applications often operate under much tighter memory, storage, and latency budgets, which constrains model size and batch behavior. Battery life presents visible user impact, and thermal throttling can materially limit sustained performance: peak NPU throughput is often substantially higher than what is sustainable under prolonged workloads. Development complexity multiplies across platforms, demanding separate implementations and careful performance tuning, while device heterogeneity requires multiple model variants. Deployment friction adds further challenges: app store review processes can take days, slowing iteration compared to cloud workflows.
|
||
|
||
### Personal Assistant and Media Processing {#sec-ml-systems-personal-assistant-media-processing-98d7}
|
||
|
||
\index{Mobile ML!computational photography} \index{Mobile ML!voice recognition} \index{Mobile ML!health monitoring}Mobile ML has achieved success across diverse applications for billions of users worldwide, and the engineering constraints behind these applications illustrate the battery and thermal trade-offs that define this paradigm. Computational photography exemplifies the challenge of running multiple ML pipelines within a thermal envelope. Modern flagships process every photo through 10-15 distinct ML models in real-time: portrait mode[^fn-portrait-mode-pipeline] uses depth estimation and segmentation, night mode captures and aligns 9-15 frames with ML-based denoising, and HDR merging, super-resolution, and scene optimization run in sequence. The engineering challenge is not any individual model but the *pipeline*: these models must share a `{python} LighthouseModels.mobile_tdp_range_str` W power budget and complete within the user's perceived shutter delay, requiring careful scheduling across CPU, GPU, and NPU to avoid thermal throttling.
|
||
|
||
Voice-driven interactions demonstrate mobile ML's layered architecture. Wake-word detection runs continuously at under 1 mW on a dedicated low-power core, speech recognition operates on the NPU at under 10 ms latency, and keyboard prediction uses context-aware neural models to reduce typing effort by 30-40%. Each layer operates at a different power tier, illustrating how mobile ML partitions workloads across heterogeneous processing units within a single SoC.
|
||
|
||
Health monitoring and augmented reality push mobile ML to its sustained-performance limits. Wearables like Apple Watch process ECG and accelerometer data entirely on-device to maintain HIPAA compliance, while AR frameworks demand consistent sub-16 ms frame times at 60 FPS for simultaneous localization, hand tracking, and scene understanding. These applications represent the ceiling of what battery-powered, passively-cooled devices can sustain, and they define the boundary beyond which mobile optimization alone is insufficient.
|
||
|
||
[^fn-portrait-mode-pipeline]: **Portrait Mode Pipeline**: This is not a single model but a sequence of real-time models for depth estimation, segmentation, and rendering. The core engineering problem is managing the *pipeline's* aggregate latency and power, not any single model's performance. The entire 10-15 model stack must execute within the user's perceived shutter delay and share the phone's `{python} LighthouseModels.mobile_tdp_range_str` W thermal budget, forcing scheduling trade-offs across the CPU, GPU, and NPU to avoid throttling. \index{Portrait Mode!pipeline latency}
|
||
|
||
These successes can create a misleading sense of ease. A common pitfall involves attempting to deploy desktop-trained models directly to mobile or edge devices without architecture modifications. Models developed on powerful workstations often fail when deployed to resource-constrained devices. A ResNet-50 model requiring 4 GB memory for inference (including activations and batch processing) and `{python} ResnetSetup.resnet_gflops_str` billion FLOPs per inference cannot run on a device with 512 MB of RAM and a 1 GFLOP/s processor. Beyond simple resource violations, desktop-optimized models may use operations unsupported by mobile hardware (specialized mathematical operations), assume floating-point precision unavailable on embedded systems, or require batch processing incompatible with single-sample inference. Successful deployment demands architecture-aware design from the beginning, including specialized architectural techniques for mobile devices such as MobileNet's depthwise separable convolutions [@howard2017mobilenets] (detailed in @sec-network-architectures), integer-only operations for microcontrollers, and optimization strategies that maintain accuracy while reducing computation.
|
||
|
||
Mobile ML demonstrates that useful intelligence can operate within a `{python} LighthouseModels.mobile_tdp_range_str` W thermal envelope on battery power. However, smartphones still cost hundreds of dollars, require gigabytes of memory, and demand user attention to recharge daily. These requirements make them unsuitable for a vast class of applications: monitoring soil moisture across a thousand-acre farm, detecting structural stress in bridge cables, or listening for endangered species in a remote forest. These scenarios demand not just lower power but a qualitatively different engineering regime, one where the device costs dollars instead of hundreds, memory is measured in kilobytes instead of gigabytes, and the system runs unattended for months or years. Mobile optimization techniques such as quantization and depthwise separable convolutions help, but they cannot bridge a 10,000-fold gap in available memory. What is needed is not a scaled-down smartphone but an entirely different class of hardware and algorithms.
|
||
|
||
## TinyML: Ubiquitous Sensing {#sec-ml-systems-tinyml-ubiquitous-sensing-scale-a67b}
|
||
|
||
\index{TinyML!ubiquitous sensing} \index{TinyML!cost efficiency}Imagine instrumenting every pallet in a warehouse, every cable on a suspension bridge, every beehive in an apiary. To put "eyes and ears" on this many physical objects—tens of thousands to millions—the device must cost dollars, not hundreds of dollars, and measure millimeters, not centimeters. Smartphones are far too expensive and too large; what is needed is intelligence at the scale of a postage stamp and the price of a cup of coffee.
|
||
|
||
\index{coin-cell battery!deployment longevity}
|
||
\index{ubiquitous computing!etymology}
|
||
TinyML [@reddi2022widening] completes the deployment spectrum by pushing intelligence to its physical limits. Devices costing less than \$10 and consuming less than 1 milliwatt[^fn-milliwatt-threshold] of power make ubiquitous[^fn-ubiquitous-computing] sensing economically practical at massive scale. This is the exclusive domain of the Tiny Constraint Archetype, where the optimization objective shifts from maximizing throughput to minimizing energy per inference. A keyword spotting model consuming 10 µJ per inference can operate for years on a coin-cell battery, achieving million-fold improvements in energy efficiency by trading model capacity for operational longevity.
|
||
|
||
[^fn-milliwatt-threshold]: **The 1 mW Threshold**: Below approximately 1 milliwatt, a device can be powered indefinitely by ambient energy harvesting---solar cells the size of a thumbnail (~10 mW outdoors, ~10 µW indoors), thermoelectric generators on warm pipes (~100 µW), or RF energy from nearby transmitters (~10 µW). This crossover transforms the deployment model from "battery-limited lifetime" to "deploy and forget," which is why 1 mW is not an arbitrary target but the physical boundary that makes TinyML a distinct paradigm rather than merely a scaled-down edge device. \index{Energy Harvesting!milliwatt threshold}
|
||
|
||
\index{microcontroller development platforms!TinyML}
|
||
Where mobile ML requires sophisticated hardware with gigabytes of memory and multi-core processors, TinyML operates on microcontrollers[^fn-mcu-resource-floor] with kilobytes of RAM and single-digit dollar price points [@banbury2021mlperftiny; @lin2020mcunet]. This radical constraint forces an entirely different approach to machine learning deployment, prioritizing ultra-low power consumption and minimal cost over computational sophistication. TinyML systems power applications such as predictive maintenance, environmental monitoring, and simple gesture recognition. The energy gap between TinyML and cloud inference spans six orders of magnitude[^fn-tinyml-energy-gap]—a 1,000,000$\times$ difference that drives entirely different system architectures and deployment models. This extraordinary efficiency enables operation for months or years on limited power sources such as coin-cell batteries[^fn-coin-cell-longevity], as exemplified by the device kits in @fig-TinyML-example. These systems deliver actionable insights in remote or disconnected environments where power, connectivity, and maintenance access are impractical.
|
||
|
||
[^fn-mcu-resource-floor]: **Microcontroller (MCU)**: A single-chip computer whose design prioritizes minimal cost and power over performance, creating the "radical constraint" mentioned. This constraint is a hard memory ceiling: ML models must fit entirely within kilobytes of on-chip SRAM (e.g., 32-512 KB), as there is no virtual memory or DRAM like in mobile devices. This resource floor, often $1{,}000\times$ lower than a smartphone's, forces the development of entirely new, memory-centric ML architectures. \index{Microcontroller!memory ceiling}
|
||
|
||
[^fn-tinyml-energy-gap]: **TinyML Energy Gap**: This differential is rooted in hardware design philosophy; cloud GPUs are optimized for raw throughput, consuming hundreds of watts, while TinyML microcontrollers are designed for near-zero power sleep states. This architectural trade-off means a single cloud inference consumes ~1 Joule, whereas a specialized TinyML device uses less than 1 microjoule—the $1{,}000{,}000\times$ gap that mandates different system designs for battery-powered operation. \index{TinyML!energy efficiency}
|
||
|
||
[^fn-coin-cell-longevity]: **Coin-Cell Deployment**: A CR2032 battery (225 mAh at 3 V, ~675 mWh) powers a TinyML model consuming 10--50 µW for 1--10 years. This "deploy-and-forget" operating model constrains models to <100 KB (fitting in on-chip SRAM) and drives innovation in intermittent computing, where the device sleeps between inferences to stretch the energy budget across years of unattended operation. \index{Coin-Cell!deployment longevity}
|
||
|
||
:::: {#fig-TinyML-example fig-env="figure" fig-pos="htb" fig-cap="**TinyML System Scale**: Small development boards, including Arduino Nano BLE Sense and similar microcontroller kits approximately 2 to 5 cm in length, with visible processor chips and pin connectors that enable sensor integration for always-on ML inference at milliwatt power budgets. Source: [@warden2018speech]." fig-alt="Small development boards including Arduino Nano BLE Sense and similar microcontroller kits arranged on a surface, each approximately 2–5 cm in length with visible chips and connectors."}
|
||

|
||
::::
|
||
|
||
[^fn-ubiquitous-computing]: **Ubiquitous Computing**: Mark Weiser's vision of "invisible" technology is achieved when the cost and power of an intelligent sensor become so low that the economic barrier to mass deployment vanishes. This forces the optimization objective to shift from performance (throughput) to power (energy per inference), the central trade-off of the Tiny Constraint Archetype. A keyword spotter achieving a million-fold energy efficiency gain can thus operate for years on a coin-cell battery, making ubiquitous intelligence practical. \index{Ubiquitous Computing!TinyML realization}
|
||
|
||
We define this paradigm formally as *TinyML*.
|
||
|
||
::: {.callout-definition title="TinyML"}
|
||
|
||
***TinyML***\index{TinyML!definition} is the domain of **Always-On Sensing** constrained by **Kilobyte-Scale Memory** and **Milliwatt-Scale Power**.
|
||
|
||
1. **Significance (Quantitative):** It necessitates models small enough to reside entirely in **On-Chip SRAM**, avoiding the high energy cost (100$\times$ higher) of DRAM access to enable continuous inference on milliwatt power budgets.
|
||
2. **Distinction (Durable):** Unlike **Mobile ML**, which uses multi-watt processors and a full OS, TinyML runs on **Microcontrollers (MCUs)** with no operating system abstraction.
|
||
3. **Common Pitfall:** A frequent misconception is that TinyML is just "small models." In reality, it is an **Energy-Bound Paradigm**: the primary metric is **Energy per Inference** (micro-joules), not just the parameter count.
|
||
|
||
:::
|
||
|
||
TinyML's milliwatt-scale power consumption represents a six-order-of-magnitude reduction from cloud inference, a gap with profound implications for system design. In terms of the Iron Law (@eq-iron-law-extended), TinyML operates in a regime where the dominant constraint is neither $O/(R_{peak} \cdot \eta)$ nor $D_{vol}/BW$, but a term the equation does not explicitly capture: $D_{vol}/\text{Capacity}$. When total memory is measured in kilobytes, the model must fit entirely on-chip, and every byte of data movement costs energy measured in picojoules. The optimization objective shifts from minimizing latency to minimizing *energy per inference*—efficiency, not speed.
|
||
|
||
```{python}
|
||
#| label: energy-inference-calc
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ ENERGY PER INFERENCE: PARADIGM COMPARISON TABLE
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Table "Energy Per Inference" — 8 orders of magnitude across paradigms
|
||
# │
|
||
# │ Goal: Contrast energy efficiency across deployment tiers.
|
||
# │ Show: That TinyML is 100,000,000× more efficient per inference than a cloud LLM query.
|
||
# │ How: Calculate Joules per inference for TinyML, Mobile, and Cloud paradigms.
|
||
# │ not cloud or even mobile inference. Battery life numbers make it visceral.
|
||
# │
|
||
# │ Imports: mlsysim.core.constants (BATTERY_*, ENERGY_MOBILENET_INF_MJ)
|
||
# │ Exports: EnergyInference.e_gpt4_str, EnergyInference.e_resnet_cloud_str,
|
||
# │ EnergyInference.e_resnet_edge_str, EnergyInference.e_mobilenet_str,
|
||
# │ EnergyInference.e_kws_str, EnergyInference.q_gpt4_str,
|
||
# │ EnergyInference.q_resnet_cloud_str, EnergyInference.q_resnet_edge_str,
|
||
# │ EnergyInference.q_mobilenet_str, EnergyInference.q_kws_str,
|
||
# │ EnergyInference.batt_cap_mah_str, EnergyInference.batt_volt_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.core.constants import (
|
||
BATTERY_CAPACITY_MAH, BATTERY_VOLTAGE_V, BATTERY_ENERGY_J,
|
||
ENERGY_MOBILENET_INF_MJ, ureg, BILLION
|
||
)
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class EnergyInference:
|
||
"""
|
||
Namespace for Energy Per Inference comparison.
|
||
Scenario: Battery life across Cloud vs. Edge vs. TinyML paradigms.
|
||
"""
|
||
|
||
# ┌── 1. LOAD (Constants) ───────────────────────────────────────────────
|
||
batt_energy_j = BATTERY_ENERGY_J
|
||
|
||
# Energy per inference (full-system estimates)
|
||
e_gpt4_j = 1000 * ureg.joule # ~1 kJ cloud LLM query
|
||
e_resnet_cloud_j = 10 * ureg.joule # ~10 J cloud ResNet-50
|
||
e_resnet_edge_j = 0.5 * ureg.joule # ~500 mJ edge ResNet-50
|
||
e_mobilenet_j = 0.05 * ureg.joule # ~50 mJ mobile MobileNet
|
||
e_kws_j = 0.00001 * ureg.joule # ~10 µJ TinyML keyword spotting
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ─────────────────────────────────────────
|
||
# Queries per full battery charge
|
||
q_gpt4 = batt_energy_j / e_gpt4_j
|
||
q_resnet_cloud = batt_energy_j / e_resnet_cloud_j
|
||
q_resnet_edge = batt_energy_j / e_resnet_edge_j
|
||
q_mobilenet = batt_energy_j / e_mobilenet_j
|
||
q_kws = batt_energy_j / e_kws_j
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||
e_gpt4_str = "~1 kJ"
|
||
e_resnet_cloud_str = "~10 J"
|
||
e_resnet_edge_str = "~500 mJ"
|
||
e_mobilenet_str = "~50 mJ"
|
||
e_kws_str = "~10 µJ"
|
||
|
||
q_gpt4_str = fmt(q_gpt4.m_as(''), precision=0, commas=True)
|
||
q_resnet_cloud_str = fmt(q_resnet_cloud.m_as(''), precision=0, commas=True)
|
||
q_resnet_edge_str = fmt(q_resnet_edge.m_as(''), precision=0, commas=True)
|
||
q_mobilenet_str = fmt(q_mobilenet.m_as(''), precision=0, commas=True)
|
||
# Use BILLION constant
|
||
q_kws_str = fmt(q_kws.m_as('') / BILLION, precision=0, commas=False) + " billion"
|
||
|
||
batt_cap_mah_str = f"{BATTERY_CAPACITY_MAH.m_as('mAh'):.0f}"
|
||
batt_volt_str = f"{BATTERY_VOLTAGE_V.m_as('V')}"
|
||
```
|
||
|
||
::: {.callout-notebook title="Energy Per Inference"}
|
||
|
||
\index{energy per inference!paradigm comparison} \index{TinyML!energy efficiency}Energy consumption spans eight orders of magnitude across deployment paradigms:
|
||
|
||
| **Paradigm** | **Example Workload** | **Energy/Inference** | **Battery Life (`{python} EnergyInference.batt_volt_str`V, `{python} EnergyInference.batt_cap_mah_str`mAh)** |
|
||
|:-------------|:---------------------|----------------------------------------------:|-------------------------------------------------------------------------------------------------------------:|
|
||
| **Cloud** | GPT-4 query | `{python} EnergyInference.e_gpt4_str` | ~`{python} EnergyInference.q_gpt4_str` queries |
|
||
| **Cloud** | ResNet-50 (A100) | `{python} EnergyInference.e_resnet_cloud_str` | ~`{python} EnergyInference.q_resnet_cloud_str` queries |
|
||
| **Edge** | ResNet-50 (Jetson) | `{python} EnergyInference.e_resnet_edge_str` | ~`{python} EnergyInference.q_resnet_edge_str` queries |
|
||
| **Mobile** | MobileNet (NPU) | `{python} EnergyInference.e_mobilenet_str` | ~`{python} EnergyInference.q_mobilenet_str` queries |
|
||
| **TinyML** | Keyword spotting | `{python} EnergyInference.e_kws_str` | ~`{python} EnergyInference.q_kws_str` queries |
|
||
|
||
Energy values represent *full-system energy* (including server CPUs, memory, networking, and cooling overhead), not isolated accelerator compute energy. For example, the A100 GPU alone executes ResNet-50 inference in under 1 ms (~0.3 J), but the full server draws ~1 kW when amortized across queuing, preprocessing, and idle power.
|
||
|
||
**Key insight**: A TinyML wake-word detector at 10 µJ/inference is **100,000,000$\times$** more energy-efficient than a cloud LLM query. This gap explains why always-on sensing is only practical at the TinyML tier—a smartphone running continuous cloud queries would drain in minutes.
|
||
:::
|
||
|
||
@fig-tiny-ml positions TinyML relative to the other paradigms. The **Characteristics** branch reveals the extreme constraints: milliwatt power and kilobyte memory. These limits enable the **Benefit** of "always-on" sensing that no other paradigm can sustain, but force engineers to solve the **Challenge** of extreme model compression.
|
||
|
||
::: {#fig-tiny-ml fig-env="figure" fig-pos="t" fig-cap="**TinyML Decomposition.** Characteristics, benefits, challenges, and representative applications of TinyML, where milliwatt power budgets and kilobyte memory limits enable always-on sensing and localized intelligence in embedded applications." fig-alt="Tree diagram with TinyML branching to four categories: Characteristics, Benefits, Challenges, and Examples, listing items like low-power operation, always-on capability, resource limitations, and predictive maintenance."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
|
||
\tikzset{
|
||
Box/.style={inner xsep=2pt,
|
||
draw=GreenLine,
|
||
fill=GreenL!50,
|
||
node distance=0.4,
|
||
line width=0.75pt,
|
||
anchor=west,
|
||
text width=32mm,align=flush center,
|
||
minimum width=32mm, minimum height=9.5mm
|
||
},
|
||
Box2/.style={Box,draw=BlueLine,fill=BlueL!50, text width=27mm, minimum width=27mm
|
||
},
|
||
Box3/.style={Box,draw=OrangeLine,fill=OrangeL!40, text width=28mm, minimum width=28mm
|
||
},
|
||
Box4/.style={Box,draw=VioletLine,fill=VioletL2!40, text width=39mm, minimum width=39mm
|
||
},
|
||
Line/.style={line width=1.0pt,black!50,text=black,-{Triangle[width=0.8*6pt,length=0.98*6pt]}},
|
||
}
|
||
\node[Box4, fill=VioletL2!90!violet!50,](B1){Characteristics};
|
||
\node[Box2,right=2 of B1,fill=BlueL](B2){Benefits};
|
||
\node[Box,right=2 of B2,fill=GreenL](B3){Challenges};
|
||
\node[Box3,right=2 of B3,fill=OrangeL](B4){Examples};
|
||
\node[Box,draw=OliveLine,fill=OliveL!30, minimum height=11.5mm,
|
||
above=1of $(B2.north east)!0.5!(B3.north west)$](B0){TinyML};
|
||
%
|
||
\node[Box4,below=0.7 of B1](B11){Low Power and Resource Constrained Environments};
|
||
\node[Box4,below=of B11](B12){On-Device Machine Learning};
|
||
\node[Box4,below=of B12](B13){Ultra-Small Form Factor};
|
||
%
|
||
\node[Box2,below=0.7 of B2](B21){Extremely Low Latency};
|
||
\node[Box2,below=of B21](B22){High Data Security};
|
||
\node[Box2,below=of B22](B23){Energy Efficiency};
|
||
\node[Box2,below=of B23](B24){Always-On Operation};
|
||
%
|
||
\node[Box,below=0.7 of B3](B31){Complex Development Cycle};
|
||
\node[Box,below=of B31](B32){Model Optimization and Compression};
|
||
\node[Box,below=of B32](B33){Resource Limitations};
|
||
%
|
||
\node[Box3,below=0.7 of B4](B41){Anomaly Detection};
|
||
\node[Box3,below=of B41](B42){Environmental Monitoring};
|
||
\node[Box3,below=of B42](B43){Predictive Maintenance};
|
||
\node[Box3,below=of B43](B44){Wearable Devices};
|
||
%
|
||
\foreach \i in{1,2,3}{
|
||
\draw[Line](B1.west)--++(180:0.5)|-(B1\i);
|
||
}
|
||
\foreach \i in{1,2,3,4}{
|
||
\draw[Line](B2.west)--++(180:0.5)|-(B2\i);
|
||
}
|
||
\foreach \i in{1,2,3}{
|
||
\draw[Line](B3.west)--++(180:0.5)|-(B3\i);
|
||
}
|
||
\foreach \i in{1,2,3,4}{
|
||
\draw[Line](B4.west)--++(180:0.5)|-(B4\i);
|
||
}
|
||
\foreach \x in{1,2,3,4}{
|
||
\draw[Line](B0)-|(B\x);
|
||
}
|
||
\end{tikzpicture}
|
||
```
|
||
:::
|
||
|
||
### TinyML Advantages and Operational Trade-offs {#sec-ml-systems-tinyml-advantages-operational-tradeoffs-2d40}
|
||
|
||
\index{TinyML!resource constraints} \index{TinyML!model compression} \index{microcontrollers!ML deployment}TinyML operates at hardware extremes. Compared to cloud systems, TinyML deployments provide $10^4$ to $10^5$ times less memory, with power budgets in the milliwatt range. These strict limitations enable months or years of autonomous operation[^fn-on-device-training-limits] but demand specialized algorithms and careful systems co-design. Devices range from palm-sized developer kits to millimeter-scale chips[^fn-tinyml-device-range], enabling ubiquitous sensing in contexts where networking, power, or maintenance are costly. Representative developer kits include the Arduino Nano 33 BLE Sense (256 KB RAM, 1 MB flash, 20–40 mW) and ESP32-CAM (520 KB RAM, 4 MB flash, 50–250 mW).
|
||
|
||
[^fn-on-device-training-limits]: **On-Device Training Constraints**: Full backpropagation requires storing activations for every layer, consuming memory proportional to model depth. With only 256 KB--2 MB RAM, microcontrollers cannot support this; alternatives like TinyTL fine-tune only the final layers using <50 KB of working memory. This memory constraint is why TinyML devices are predominantly inference-only, with model updates pushed via firmware rather than learned in situ. \index{TinyML!training memory constraint}
|
||
|
||
[^fn-tinyml-device-range]: **TinyML Device Range**: This physical range reflects a direct trade-off between deployment context and computational capability. Millimeter-scale systems prioritize minimal power (~140 µW) for single-function, long-duration tasks, whereas palm-sized boards trade larger size and higher power for the ability to process multiple complex sensor streams. This co-design choice creates a >10,000$\times$ power and ~100$\times$ area difference across the operational spectrum of TinyML devices. \index{TinyML!device form factor}
|
||
|
||
TinyML's extreme resource constraints paradoxically enable unique advantages. By avoiding network transmission entirely, TinyML devices achieve the lowest end-to-end latency in the deployment spectrum, enabling rapid local responses for sensing and control loops without communication overhead. This self-sufficiency also transforms the economics of large-scale deployments: when per-node costs drop to single-digit dollars, instrumenting an entire factory floor, farm, or building becomes financially viable in ways that edge or cloud alternatives cannot match. Energy efficiency compounds the economic case, enabling multi-year operation on small batteries or even indefinite operation through energy harvesting. Privacy benefits follow naturally from locality—raw data never leaves the device, reducing transmission risks and simplifying compliance—though on-device processing alone does not automatically provide formal privacy guarantees without additional security mechanisms.
|
||
|
||
These capabilities require substantial trade-offs. Computational constraints impose severe limits: microcontrollers commonly provide $10^5$ to $10^6$ bytes of RAM, forcing models and intermediate activations into the tens-of-kilobytes to low-megabytes range depending on the workload. Development complexity requires expertise spanning neural network optimization, hardware-level memory management, embedded toolchains, and specialized debugging across diverse microcontroller architectures.
|
||
|
||
Beyond these technical constraints, operational challenges compound the difficulty. Model quality can suffer from aggressive compression and reduced precision, limiting suitability for applications requiring high accuracy or robustness. Deployment can also be inflexible: devices may run a small set of fixed models, and updates may require firmware workflows that are slower and riskier than cloud rollouts. Ecosystem fragmentation[^fn-tinyml-compression-stack] across microcontroller vendors and ML frameworks creates additional overhead and portability challenges.
|
||
|
||
[^fn-tinyml-compression-stack]: **TinyML Ecosystem Fragmentation**: Unlike cloud or mobile ML, where PyTorch or TensorFlow Lite provide a single optimization path, TinyML spans dozens of incompatible microcontroller families (ARM Cortex-M, RISC-V, Xtensa), each with different instruction sets, memory layouts, and vendor-specific toolchains. A model optimized for one target often requires re-quantization and re-validation for another, multiplying the engineering cost of multi-device deployment and creating portability barriers absent from higher-resource paradigms. \index{TinyML!ecosystem fragmentation}
|
||
|
||
### Environmental and Health Monitoring {#sec-ml-systems-environmental-health-monitoring-14ad}
|
||
|
||
\index{TinyML!wake-word detection} \index{TinyML!precision agriculture} \index{TinyML!medical wearables}TinyML succeeds across domains where ultra-low power, low per-node cost, and local processing enable applications that no other paradigm can sustain.
|
||
|
||
Wake-word detection is the most familiar consumer application of TinyML. These systems listen continuously at sub-milliwatt power consumption, processing audio streams locally and activating higher-power components only when a wake phrase is detected—a design that dramatically reduces average device power draw[^fn-wearable-always-on].
|
||
|
||
Precision agriculture exploits TinyML's economic advantages where traditional solutions prove cost-prohibitive. Deployments can instrument thousands of monitoring points with multi-year battery operation, transmitting summaries instead of raw sensor streams to reduce connectivity costs.
|
||
|
||
\index{wildlife conservation!TinyML monitoring}
|
||
Wildlife conservation uses TinyML for remote environmental monitoring. Researchers deploy solar-powered audio sensors consuming 100–500 mW that process continuous audio streams for species identification. By performing local analysis, these systems reduce satellite transmission requirements from 4.3 GB per day to 400 KB of detection summaries, a 10,000$\times$ reduction that makes large-scale deployments of 100–1,000 sensors economically feasible.
|
||
|
||
Medical wearables push TinyML into healthcare, where the combination of always-on monitoring and on-device privacy proves uniquely valuable. FDA-cleared cardiac monitors achieve 95–98% sensitivity while processing 250–500 ECG samples per second at under 5 mW power consumption. This efficiency enables week-long continuous monitoring versus hours for smartphone-based alternatives, while reducing diagnostic costs from \$2,000–5,000 for traditional in-lab studies to under \$100 for at-home testing.
|
||
|
||
With TinyML, we have completed our tour of the four deployment paradigms—from megawatt data centers to milliwatt microcontrollers. Each paradigm emerged as a response to specific physical constraints, and each excels within its operating envelope. The question of *how* an engineer should choose among them—and what to do when no single paradigm satisfies all requirements—motivates the comparative analysis that follows.
|
||
|
||
[^fn-wearable-always-on]: **Always-On Wake-Word Detection**: This sub-milliwatt power target is met by a simple, specialized model that does nothing but listen for the acoustic signature of the wake phrase. This model acts as an aggressive power gate, preventing the needless activation of the main application processor, which consumes 100--1,000$\times$ more power. The entire energy-saving architecture fails if this always-on component exceeds its stringent power budget of roughly 1 milliwatt. \index{Wearable ML!always-on power budget}
|
||
|
||
## Paradigm Selection {#sec-ml-systems-comparative-analysis-paradigm-selection-bf66}
|
||
|
||
Each paradigm emerged as a response to specific physical constraints: Cloud ML accepts latency for unlimited compute, Edge ML trades compute for latency, Mobile ML trades compute for portability, and TinyML trades compute for ubiquity. *How* do these paradigms compare quantitatively across all dimensions? And given a specific application, *how* should an engineer select among them? This section synthesizes the individual paradigm analyses into a unified comparison framework and a structured decision process.
|
||
|
||
### Quantitative Trade-off Analysis {#sec-ml-systems-quantitative-tradeoff-analysis-56a8}
|
||
|
||
\index{latency vs throughput!paradigm trade-offs}The preceding four sections traced each paradigm individually, revealing its strengths, constraints, and representative applications. However, deployment decisions require seeing all four paradigms *side by side* across the dimensions that matter. A system architect choosing between edge and mobile deployment needs to compare not just latency, but also power, cost, privacy, and development complexity simultaneously.
|
||
|
||
@tbl-big_vs_tiny provides this comparison across fourteen dimensions, from compute power and latency to cost and deployment speed.
|
||
|
||
```{python}
|
||
#| label: paradigms-table
|
||
#| echo: false
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ PARADIGMS TABLE: CLOUD VS EDGE VS MOBILE VS TINYML
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Table @tbl-big_vs_tiny — 14-dimension paradigm comparison
|
||
# │
|
||
# │ Goal: Synthesize the four deployment paradigms into a single reference table.
|
||
# │ Show: Dimensional constraints (latency, compute, energy) across tiers.
|
||
# │ How: List representative values for Cloud, Mobile, Edge, and TinyML.
|
||
# │ orders-of-magnitude differences that drive paradigm selection.
|
||
# │
|
||
# │ Imports: (none — pure display constants)
|
||
# │ Exports: ParadigmsTable.cloud_lat_str, ParadigmsTable.edge_lat_str,
|
||
# │ ParadigmsTable.mobile_lat_str, ParadigmsTable.tiny_lat_str,
|
||
# │ ParadigmsTable.cloud_comp_str, ParadigmsTable.edge_comp_str,
|
||
# │ ParadigmsTable.mobile_comp_str, ParadigmsTable.tiny_comp_str,
|
||
# │ ParadigmsTable.cloud_stor_str, ParadigmsTable.edge_stor_str,
|
||
# │ ParadigmsTable.mobile_stor_str, ParadigmsTable.tiny_stor_str,
|
||
# │ ParadigmsTable.cloud_pwr_str, ParadigmsTable.edge_pwr_str,
|
||
# │ ParadigmsTable.mobile_pwr_str, ParadigmsTable.tiny_pwr_str,
|
||
# │ ParadigmsTable.cloud_cost_str, ParadigmsTable.edge_cost_str,
|
||
# │ ParadigmsTable.mobile_cost_str, ParadigmsTable.tiny_cost_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class ParadigmsTable:
|
||
"""Namespace for Paradigms Table."""
|
||
|
||
# --- Latency (network + inference time) ---
|
||
cloud_lat_str = "100 ms-1000 ms+" # cloud round-trip
|
||
edge_lat_str = "10-100 ms" # local network + inference
|
||
mobile_lat_str = "5-50 ms" # on-device inference
|
||
tiny_lat_str = "1-10 ms" # MCU response time
|
||
|
||
# --- Compute capability ---
|
||
cloud_comp_str = "Very High (Multiple GPUs/TPUs)" # kW-class accelerators
|
||
edge_comp_str = "High (Edge GPUs)" # 10s-100s W accelerators
|
||
mobile_comp_str = "Moderate (Mobile NPUs/GPUs)" # 1-10 W NPUs
|
||
tiny_comp_str = "Very Low (MCU/tiny processors)" # mW-class MCUs
|
||
|
||
# --- Storage capacity ---
|
||
cloud_stor_str = "Unlimited (petabytes+)" # elastic cloud storage
|
||
edge_stor_str = "Large (terabytes)" # local SSDs
|
||
mobile_stor_str = "Moderate (gigabytes)" # phone flash
|
||
tiny_stor_str = "Very Limited (kilobytes-megabytes)" # SRAM/flash
|
||
|
||
# --- Energy consumption ---
|
||
cloud_pwr_str = "Very High (kW-MW range)" # data center scale
|
||
edge_pwr_str = "High (100 s W)" # edge server scale
|
||
mobile_pwr_str = "Moderate (1-10 W)" # phone TDP
|
||
tiny_pwr_str = "Very Low (mW range)" # energy harvesting
|
||
|
||
# --- Cost structure ---
|
||
cloud_cost_str = "High ($1000s+/month)" # usage-based cloud
|
||
edge_cost_str = "Moderate ($100s-1000s)" # hardware capex
|
||
mobile_cost_str = "Low ($0-10s)" # app distribution
|
||
tiny_cost_str = "Very Low ($1-10s)" # MCU unit cost
|
||
```
|
||
|
||
The resulting fourteen-dimension comparison appears in @tbl-big_vs_tiny:
|
||
|
||
| **Aspect** | **Cloud ML** | **Edge ML** | **Mobile ML** | **TinyML** |
|
||
|:---------------------------|:-----------------------------------------|:----------------------------------------|:------------------------------------------|:------------------------------------------------------|
|
||
| **Processing Location** | Centralized cloud servers (Data Centers) | Local edge devices (gateways, servers) | Smartphones and tablets | Ultra-low-power microcontrollers and embedded systems |
|
||
| **Latency** | `{python} ParadigmsTable.cloud_lat_str` | `{python} ParadigmsTable.edge_lat_str` | `{python} ParadigmsTable.mobile_lat_str` | `{python} ParadigmsTable.tiny_lat_str` |
|
||
| **Compute Power** | `{python} ParadigmsTable.cloud_comp_str` | `{python} ParadigmsTable.edge_comp_str` | `{python} ParadigmsTable.mobile_comp_str` | `{python} ParadigmsTable.tiny_comp_str` |
|
||
| **Storage Capacity** | `{python} ParadigmsTable.cloud_stor_str` | `{python} ParadigmsTable.edge_stor_str` | `{python} ParadigmsTable.mobile_stor_str` | `{python} ParadigmsTable.tiny_stor_str` |
|
||
| **Energy Consumption** | `{python} ParadigmsTable.cloud_pwr_str` | `{python} ParadigmsTable.edge_pwr_str` | `{python} ParadigmsTable.mobile_pwr_str` | `{python} ParadigmsTable.tiny_pwr_str` |
|
||
| **Scalability** | Excellent (virtually unlimited) | Good (limited by edge hardware) | Moderate (per-device scaling) | Limited (fixed hardware) |
|
||
| **Data Privacy** | Basic-Moderate (Data leaves device) | High (Data stays in local network) | High (Data stays on phone) | Very High (Raw data can remain local) |
|
||
| **Connectivity Required** | Constant high-bandwidth | Intermittent | Optional | None |
|
||
| **Offline Capability** | None | Good | Excellent | Complete |
|
||
| **Real-time Processing** | Dependent on network | Good | Very Good | Excellent |
|
||
| **Cost** | `{python} ParadigmsTable.cloud_cost_str` | `{python} ParadigmsTable.edge_cost_str` | `{python} ParadigmsTable.mobile_cost_str` | `{python} ParadigmsTable.tiny_cost_str` |
|
||
| **Hardware Requirements** | Cloud infrastructure | Edge servers/gateways | Modern smartphones | MCUs/embedded systems |
|
||
| **Development Complexity** | High (cloud expertise needed) | Moderate-High (edge+networking) | Moderate (mobile SDKs) | High (embedded expertise) |
|
||
| **Deployment Speed** | Fast | Moderate | Fast | Slow |
|
||
|
||
: **Fourteen-Dimension Paradigm Comparison**\index{scalability!paradigm comparison}\index{offline capability!paradigm comparison}: A comprehensive side-by-side comparison across fourteen dimensions that matter for deployment decisions. Note the inverse relationship between compute power and privacy: Cloud ML provides the strongest compute but weaker privacy guarantees, while TinyML provides the strongest privacy but the weakest compute. This table serves as the primary reference for system architects evaluating deployment options. {#tbl-big_vs_tiny}
|
||
|
||
This inverse relationship between privacy and compute is not coincidental—it reflects the inherent trade-off between data locality and computational scale. Data that stays local cannot be processed at datacenter scale, and data that moves to the cloud cannot remain fully private. The archetype-paradigm mapping established in @sec-ml-systems-analyzing-workloads-cbb8 connects these characteristics to specific workload requirements, with each archetype gravitating toward paradigms that address its binding constraint.
|
||
|
||
@fig-op_char plots these trade-offs as radar charts, where each paradigm forms a polygon and larger areas indicate stronger performance on that axis. Plot a) contrasts compute power and scalability, where Cloud ML excels, against latency and energy efficiency, where TinyML dominates. Edge and Mobile ML occupy intermediate positions.
|
||
|
||
::: {#fig-op_char fig-env="figure" fig-pos="t" fig-cap="**Paradigm Comparison Radar Plots.** Two radar plots quantify performance and operational characteristics across cloud, edge, mobile, and TinyML paradigms. The left plot contrasts compute power, latency, scalability, and energy efficiency; the right plot contrasts connectivity independence, privacy, real-time capability, and offline operation. In both plots, higher scores indicate better performance on that dimension." fig-alt="Two radar plots with four overlapping polygons each. Left plot axes: compute power, latency, scalability, energy efficiency. Right plot axes: connectivity independence, privacy, real-time, offline capability."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[font=\usefont{T1}{phv}{m}{n}]
|
||
\pgfplotsset{myaxis/.style={
|
||
y axis line style={draw=none},
|
||
x axis line style={draw=black,line width=1 pt},
|
||
width=8cm,
|
||
height=8cm,
|
||
grid=both,
|
||
grid style={black!30,dashed},
|
||
tick align=inside,
|
||
tick style={draw=none},
|
||
ymin=0, ymax=10,
|
||
ytick={1,3,5,7,9},
|
||
yticklabels={},
|
||
xtick={0,90,180,270},
|
||
xticklabel style={align=left,font=\fontsize{8pt}{9}\selectfont\usefont{T1}{phv}{m}{n}},
|
||
% yticklabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
|
||
yticklabel style={
|
||
rotate around={50:(axis cs:0,0)},
|
||
anchor=center
|
||
},
|
||
xlabel style={font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n},rotate=30},
|
||
label distance=5pt,
|
||
legend style={at={(1.25,1)}, anchor=north},
|
||
legend cell align=left,
|
||
legend style={fill=BrownL!30,draw=BrownLine,row sep=2.1pt,
|
||
font=\fontsize{7pt}{7}\selectfont\usefont{T1}{phv}{m}{n}},
|
||
cycle list={
|
||
{myblue,line width=1.5pt,fill=myblue!70,fill opacity=0.9},
|
||
{mygreen,line width=1.5pt,fill=mygreen!70,fill opacity=0.4},
|
||
{myorange,line width=1.5pt,fill=myorange!20,fill opacity=0.4},
|
||
{myred,line width=1.5pt,fill=myred!70,fill opacity=0.4},
|
||
},
|
||
after end axis/.code={
|
||
% manual y-tick labels
|
||
\foreach \R in {1,3,5,7,9}{
|
||
\pgfmathtruncatemacro{\newR}{\R + 0.5} %
|
||
\node[
|
||
font=\footnotesize\usefont{T1}{phv}{m}{n},
|
||
anchor=base
|
||
]
|
||
at (axis cs:50,\newR) {\R};
|
||
}
|
||
},
|
||
legend image code/.code={
|
||
% rectangle in Legend
|
||
\draw[fill=#1,draw=none,fill opacity=1]
|
||
(0pt,-2pt) rectangle (4mm,3pt);
|
||
}
|
||
}}
|
||
%left graph
|
||
\begin{scope}[local bounding box=GR1,shift={(0,0)}]
|
||
\begin{polaraxis}[myaxis,
|
||
xticklabels={Compute\\ Power, Latency, Scalability,Energy\\ Efficiency},
|
||
]
|
||
% Cloud ML
|
||
\addplot+[] coordinates {(0,10) (90,2) (180,10) (270,3) (360,10)};
|
||
% Edge ML
|
||
\addplot+[] coordinates {(0,8) (90,7) (180,8) (270,5) (360,8)};
|
||
% Mobile ML
|
||
\addplot+[] coordinates {(0,6) (90,8) (180,7) (270,7) (360,6)};
|
||
% TinyML
|
||
\addplot+[] coordinates {(0,3) (90,9) (180,5) (270,10) (360,3)};
|
||
\legend{Cloud ML, Edge ML, Mobile ML, TinyML}
|
||
\addplot[draw=myblue,line width=1.5pt] coordinates {(0,10) (90,2) (180,10) (270,3) (360,10)};
|
||
\addplot[draw=mygreen,line width=1.5pt] coordinates {(0,8) (90,7) (180,8) (270,5) (360,8)};
|
||
|
||
\end{polaraxis}
|
||
\end{scope}
|
||
\node[below=2mm of GR1,xshift=-5mm]{\large a)};
|
||
%right graph
|
||
\begin{scope}[local bounding box=GR2,shift={(10,0)}]
|
||
\begin{polaraxis}[myaxis,
|
||
xticklabels={Connectivity\\ Independence, Data Privacy, Real-time\\ Processing,Offline Capability},
|
||
]
|
||
% Cloud ML
|
||
\addplot+[] coordinates {(0,2) (90,3) (180,2) (270,2) (360,2)};
|
||
% Edge ML
|
||
\addplot+[] coordinates {(0,7) (90,7) (180,8) (270,6) (360,7)};
|
||
% Mobile ML
|
||
\addplot+[] coordinates {(0,8) (90,9) (180,7) (270,8) (360,8)};
|
||
% TinyML
|
||
\addplot+[] coordinates {(0,10) (90,10) (180,10) (270,10) (360,10)};
|
||
%\legend{Cloud ML, Edge ML, Mobile ML, TinyML}
|
||
\addplot[draw=myblue,line width=1.5pt] coordinates {(0,2) (90,3) (180,2) (270,2) (360,2)};
|
||
\addplot[draw=mygreen,line width=1.5pt] coordinates {(0,7) (90,7) (180,8) (270,6) (360,7)};
|
||
\end{polaraxis}
|
||
\end{scope}
|
||
\node[below=2mm of GR2]{\large b)};
|
||
\end{tikzpicture}
|
||
```
|
||
:::
|
||
|
||
Plot b) emphasizes operational dimensions where TinyML excels (privacy, connectivity independence, offline capability) versus Cloud ML's reliance on centralized infrastructure and constant connectivity.
|
||
|
||
Development complexity varies inversely with hardware capability: Cloud and TinyML require deep expertise (cloud infrastructure and embedded systems, respectively), while Mobile and Edge use more accessible SDKs and tooling. Cost structures follow a similar pattern: Cloud incurs ongoing operational expenses (\$1,000s+/month), Edge requires moderate upfront investment (\$100s-\$1,000s), Mobile uses existing devices (\$0-\$10s), and TinyML minimizes hardware costs (\$1-\$10s) while demanding higher development investment.
|
||
|
||
A critical pitfall in deployment selection is choosing paradigms based solely on model accuracy without considering system-level constraints. A cloud-deployed model achieving 99% accuracy becomes useless for autonomous emergency braking if network latency exceeds reaction time requirements; a high-accuracy edge model that drains a mobile device's battery in minutes fails despite superior accuracy. Successful deployment requires evaluating latency requirements, power budgets, network reliability, data privacy regulations, and total cost of ownership simultaneously. These constraints should be established *before* model development to avoid expensive architectural pivots late in the project.
|
||
|
||
### Decision Framework {#sec-ml-systems-decision-framework-241f}
|
||
|
||
\index{decision framework!paradigm selection} Selecting the appropriate deployment paradigm requires systematic evaluation of application constraints rather than organizational biases or technology trends. Follow the decision tree in @fig-mlsys-playbook-flowchart, which filters options through a hierarchy of critical requirements: privacy, latency, computational demands, and cost constraints.
|
||
|
||
::: {#fig-mlsys-playbook-flowchart fig-env="figure" fig-pos="t" fig-cap="**Deployment Decision Logic**: This flowchart guides selection of an appropriate machine learning deployment paradigm by systematically evaluating privacy requirements and processing constraints, ultimately balancing performance, cost, and data security. Navigating the decision tree helps practitioners determine whether cloud, edge, mobile, or tiny machine learning best suits a given application." fig-alt="Decision flowchart with four layers: Privacy, Performance, Compute Needs, and Cost. Each layer filters toward deployment options: Cloud ML, Edge ML, Mobile ML, or TinyML based on constraints."}
|
||
```{.tikz}
|
||
\resizebox{.7\textwidth}{!}{%
|
||
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n},line width=0.75pt]
|
||
\tikzset{
|
||
Line/.style={line width=1.0pt,black!50,text=black},
|
||
Box/.style={inner xsep=2pt,
|
||
draw=GreenLine, line width=0.65pt,
|
||
fill=GreenL,
|
||
text width=25mm,align=flush center,
|
||
minimum width=25mm, minimum height=9mm
|
||
},
|
||
Box1/.style={inner xsep=2pt,
|
||
node distance=0.5,
|
||
draw=BlueLine, line width=0.65pt,
|
||
fill=BlueL,
|
||
text width=33mm,align=flush center,
|
||
minimum width=33mm, minimum height=9mm
|
||
},
|
||
Text/.style={inner xsep=2pt,
|
||
draw=none, line width=0.75pt,
|
||
fill=TextColor,
|
||
font=\footnotesize\usefont{T1}{phv}{m}{n},
|
||
align=flush center,
|
||
minimum width=7mm, minimum height=5mm
|
||
},
|
||
}
|
||
%
|
||
\begin{scope}
|
||
\node[Box, rounded corners=12pt,fill=magenta!20](B1){Start};
|
||
\node[Box1,below=of B1](B2){Is privacy critical?};
|
||
\node[Box,below left=0.1 and 1 of B2](B3){Cloud Processing Allowed};
|
||
\node[Box,below right=0.1 and 1 of B2](B4){Local Processing Preferred};
|
||
\draw[Line,-latex](B1)--(B2);
|
||
\draw[Line,-latex](B2)-|node[Text,pos=0.2]{No}(B3);
|
||
\draw[Line,-latex](B2)-|node[Text,pos=0.2]{Yes}(B4);
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=12mm,inner ysep=3mm,yshift=-1mm,
|
||
fill=BackColor,fit=(B1)(B3)(B4),line width=0.75pt](BB){};
|
||
\node[below=11pt of BB.north east,anchor=east]{Layer: Privacy};
|
||
\end{scope}
|
||
%
|
||
\begin{scope}[shift={(0,-4.6)}]
|
||
\node[Box1](2B1){Is low latency required ($<$10 ms)?};
|
||
\node[Box,below left=0.1 and 1 of 2B1](2B2){Latency Tolerant};
|
||
\node[Box,below right=0.1 and 1 of 2B1](2B3){Tiny or Edge ML};
|
||
\draw[Line,-latex](2B1)-|node[Text,pos=0.2]{No}(2B2);
|
||
\draw[Line,-latex](2B1)-|node[Text,pos=0.2]{Yes}(2B3);
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=12mm,inner ysep=4mm,yshift=0mm,
|
||
fill=BackColor,fit=(2B1)(2B2)(2B3),line width=0.75pt](BB1){};
|
||
\node[below=11pt of BB1.north east,anchor=east]{Layer: Performance};
|
||
\end{scope}
|
||
\draw[Line,-latex](B3)--++(270:1.1)-|(2B1.110);
|
||
\draw[Line,-latex](B4)--++(270:1.1)-|(2B1.70);
|
||
%
|
||
\begin{scope}[shift={(0,-8.0)}]
|
||
\node[Box1](3B1){Does the model require significant compute?};
|
||
\node[Box,below left=0.1 and 1 of 3B1](3B2){Heavy Compute};
|
||
\node[Box,below right=0.1 and 1 of 3B1](3B3){Lightweight Processing};
|
||
\draw[Line,-latex](3B1)-|node[Text,pos=0.2]{Yes}(3B2);
|
||
\draw[Line,-latex](3B1)-|node[Text,pos=0.2]{No}(3B3);
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=12mm,inner ysep=5mm,yshift=1mm,
|
||
fill=BackColor,fit=(3B1)(3B2)(3B3),line width=0.75pt](BB2){};
|
||
\node[below=11pt of BB2.north east,anchor=east]{Layer: Compute Needs};
|
||
\end{scope}
|
||
\draw[Line,-latex](2B2)--++(270:1.1)-|(3B1.110);
|
||
\draw[Line,-latex](2B3)--++(270:1.1)-|(3B1.70);
|
||
%4
|
||
\begin{scope}[shift={(0,-11.4)}]
|
||
\node[Box1](4B1){Are there strict cost constraints?};
|
||
\node[Box,below left=0.1 and 1 of 4B1](4B2){Flexible Budget};
|
||
\node[Box,below right=0.1 and 1 of 4B1](4B3){Low-Cost Options};
|
||
\draw[Line,-latex](4B1)-|node[Text,pos=0.2]{No}(4B2);
|
||
\draw[Line,-latex](4B1)-|node[Text,pos=0.2]{Yes}(4B3);
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=12mm,inner ysep=4mm,yshift=2mm,
|
||
fill=BackColor,fit=(4B1)(4B2)(4B3),line width=0.75pt](BB3){};
|
||
\node[below=11pt of BB3.north east,anchor=east]{Layer: Cost};
|
||
\end{scope}
|
||
\draw[Line,-latex](3B2)--++(270:1.1)-|(4B1.110);
|
||
\draw[Line,-latex](3B3)--++(270:1.1)-|(4B1.70);
|
||
%5
|
||
\begin{scope}[shift={(-0.45,-14.0)},anchor=north east]
|
||
\node[Box,fill=magenta!20,rounded corners=12pt,text width=18mm,
|
||
minimum width=17mm](5B1){Cloud ML};
|
||
\node[Box,node distance=1.0,fill=magenta!20,rounded corners=12pt,left=of 5B1,text width=18mm,
|
||
minimum width=17mm](5B2){Edge ML};
|
||
\node[Box,node distance=1.0,fill=magenta!20, rounded corners=12pt,right=of 5B1,text width=18mm,
|
||
minimum width=17mm](5B3){Mobile ML};
|
||
\node[Box,node distance=1.0,fill=magenta!20, rounded corners=12pt,right=of 5B3,text width=18mm,
|
||
minimum width=17mm](5B4){TinyML};
|
||
%
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=12mm,inner ysep=5mm,yshift=-1mm,
|
||
fill=BackColor,fit=(5B1)(5B2)(5B4),line width=0.75pt](BB4){};
|
||
\node[above=8pt of BB4.south east,anchor=east]{Layer: Deployment Options};
|
||
\end{scope}
|
||
\draw[Line,-latex](4B3)-|(5B3);
|
||
\draw[Line,-latex](4B3)--++(270:0.92)-|(5B4);
|
||
\draw[Line,-latex](4B2)--++(270:0.92)-|(5B1);
|
||
\draw[Line,-latex](3B2.west)--++(180:0.5)|-(5B2);
|
||
\end{tikzpicture}}
|
||
```
|
||
:::
|
||
|
||
The framework evaluates four critical decision layers sequentially. Privacy constraints form the first filter, determining whether data can be transmitted externally. Applications handling sensitive data under GDPR, HIPAA, or proprietary restrictions mandate local processing, immediately eliminating cloud-only deployments. Latency requirements establish the second constraint through response time budgets: applications requiring sub-10 ms response times cannot use cloud processing, as physics-imposed network delays alone exceed this threshold. Computational demands form the third evaluation layer, assessing whether applications require high-performance infrastructure that only cloud or edge systems provide, or whether they can operate within the resource constraints of mobile or tiny devices. Cost considerations complete the framework by balancing capital expenditure, operational expenses, and energy efficiency across expected deployment lifetimes.
|
||
|
||
The following worked example applies this framework step by step to a safety-critical application: *autonomous vehicle emergency braking*.
|
||
|
||
::: {.callout-notebook title="Autonomous Vehicle Emergency Braking"}
|
||
|
||
**Application**: Vision-based pedestrian detection for emergency braking.
|
||
|
||
**Walking through the decision framework**:
|
||
|
||
1. **Privacy**: Vehicle camera data is not transmitted to third parties → No strong privacy constraint. *Could use cloud.*
|
||
```{python}
|
||
#| label: braking-distance-calc
|
||
#| echo: false
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ BRAKING DISTANCE CALCULATION
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Emergency braking example in @sec-ml-systems-decision-framework-241f
|
||
# │
|
||
# │ Goal: Calculate the distance a car travels during a given latency.
|
||
# │ Show: That at 100 km/h, a 100 ms latency equals 2.8 meters of travel.
|
||
# │ How: Convert km/h to m/s; multiply by latency in seconds.
|
||
# │
|
||
# │ Imports: mlsysim.book (check)
|
||
# │ Exports: BrakingDistance.speed_str, BrakingDistance.distance_str,
|
||
# │ BrakingDistance.latency_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
from mlsysim.fmt import check
|
||
|
||
class BrakingDistance:
|
||
# ┌── 1. LOAD ──────────────────────────────────────────
|
||
speed_kmh = 100
|
||
latency_ms = 100
|
||
# ┌── 2. EXECUTE ───────────────────────────────────────
|
||
speed_ms = speed_kmh / 3.6 # km/h -> m/s
|
||
distance_m = speed_ms * (latency_ms / 1000)
|
||
# ┌── 4. OUTPUT ────────────────────────────────────────
|
||
speed_str = f"{speed_kmh}"
|
||
distance_str = f"{distance_m:.1f}"
|
||
latency_str = f"{latency_ms}"
|
||
```
|
||
|
||
2. **Latency**: Emergency braking requires <`{python} BrakingDistance.latency_str` ms total response. At `{python} BrakingDistance.speed_str` km/h, a car travels `{python} BrakingDistance.distance_str` meters in `{python} BrakingDistance.latency_str` ms.
|
||
- Network latency to cloud: 50-150 ms (variable) → **Fails requirement**
|
||
- Edge processing: 10-30 ms → **Passes**
|
||
- *Decision: Cloud eliminated by physics.*
|
||
3. **Compute**: Pedestrian detection requires ~10 GFLOPs at 30 FPS = 300 GFLOPs/s sustained.
|
||
- TinyML (<1 GFLOP/s): **Fails**
|
||
- Mobile NPU (~35 TOPS): Possible but thermal constraints limit sustained operation
|
||
- Edge GPU (~10+ TFLOPS): **Passes with margin**
|
||
- *Decision: Edge or high-end Mobile.*
|
||
4. **Cost**: Safety-critical, high-volume production (millions of vehicles).
|
||
- Edge GPU: \$500-1000 per vehicle, amortized over 10+ year vehicle life = \$50-100/year
|
||
- *Decision: Edge GPU justified for safety-critical application.*
|
||
|
||
**Result**: Edge ML with local GPU (NVIDIA Drive Orin class). Cloud used only for training, model updates, and fleet-wide analytics—not real-time inference.
|
||
|
||
**Key insight**: Latency constraints eliminated 75% of options before we considered compute or cost.
|
||
:::
|
||
|
||
The decision framework above identifies technically feasible options, but feasibility does not guarantee success. Production deployment also depends on organizational capabilities that determine whether a technically sound choice can be implemented and maintained effectively.
|
||
|
||
Successful deployment requires considering factors beyond pure engineering constraints. Team expertise must align with paradigm requirements—Cloud ML demands distributed systems knowledge, Edge ML requires device management capabilities, Mobile ML needs platform-specific optimization skills, and TinyML requires embedded systems expertise—and organizations lacking appropriate skills face extended development timelines that can undermine even the strongest technical advantages. Monitoring and maintenance capabilities similarly determine viability at scale: edge deployments require distributed device orchestration, while TinyML demands specialized firmware management that many organizations lack. Cost structures add another dimension, because the temporal pattern of expenses varies dramatically across paradigms. Cloud incurs recurring operational costs favorable for unpredictable workloads; Edge requires substantial upfront investment offset by lower ongoing costs; Mobile uses user-provided devices to minimize infrastructure expenses; and TinyML minimizes hardware costs while demanding significant development investment.
|
||
|
||
These organizational realities surface a broader concern: a machine learning approach is not always the right choice. Every ML deployment carries a *complexity tax* that must be weighed against simpler alternatives.
|
||
|
||
::: {.callout-perspective title="The Complexity Tax"}
|
||
|
||
\index{complexity tax!ML vs heuristics} Before committing to any ML deployment, weigh the **Complexity Tax** against simpler alternatives.
|
||
|
||
Consider a classification problem solvable by either a **Heuristic** (if-then rules) or a **Deep Learning Pipeline**:
|
||
|
||
1. **The Heuristic**: 50 lines of code. Near-zero compute cost. Maintenance: ~1 hour/month to update rules. No drift.
|
||
2. **The ML System**: 50 lines of model code + 2,000 lines of infrastructure (data pipelines, monitoring, GPU drivers). Maintenance: ~40 hours/month debugging drift and managing infrastructure.
|
||
|
||
If the ML system provides 95% accuracy and the heuristic provides 90%, is that 5% gain worth a **40$\times$ increase** in complexity? ML systems engineering is the art of minimizing this tax through robust architecture. If you cannot afford the operational cost to maintain model quality over time, the simpler heuristic may be the superior systems choice.
|
||
:::
|
||
|
||
This complexity tax applies to every deployment decision. Before proceeding to hybrid architectures, the following checkpoint tests whether these trade-offs are clear.
|
||
|
||
::: {.callout-checkpoint title="System Design"}
|
||
The central trade-off is often **Accuracy vs. Complexity**.
|
||
|
||
**Decision Gates**
|
||
|
||
- [ ] **The Baseline**: Have you measured the accuracy of a simple heuristic (regex, logistic regression) before training a Deep Network?
|
||
- [ ] **The Infrastructure Cost**: Is the 2% accuracy gain from a Transformer worth the 10$\times$ inference cost and maintenance burden compared to a smaller model?
|
||
:::
|
||
|
||
Successful deployment balances technical optimization against organizational capability. Paradigm selection extends well beyond technical requirements to encompass team skills, operational capacity, and economic constraints, all constrained by the physical scaling laws we have examined. Operational aspects are detailed in @sec-ml-operations and benchmarking approaches in @sec-benchmarking. In practice, however, the decision framework rarely points to a single winner. Most production systems combine multiple paradigms, training in the cloud, serving at the edge, preprocessing on mobile, to satisfy constraints that no single deployment target can meet alone.
|
||
|
||
## Hybrid Architectures {#sec-ml-systems-hybrid-architectures-combining-paradigms-7cdd}
|
||
|
||
\index{hybrid architectures!combining paradigms} \index{Hybrid ML!integration strategies}The decision framework (@fig-mlsys-playbook-flowchart) helps select the best single paradigm for a given application. In practice, however, production systems rarely use just one paradigm. Voice assistants combine TinyML wake-word detection with mobile speech recognition and cloud natural language understanding. Autonomous vehicles pair edge inference for real-time perception with cloud training for model updates. These hybrid architectures exploit the strengths of multiple paradigms while mitigating their individual weaknesses. This section formalizes the integration strategies that make such combinations effective.
|
||
|
||
::: {.callout-definition title="Hybrid ML"}
|
||
|
||
***Hybrid Machine Learning***\index{Hybrid ML!definition} is the architectural strategy of **Hierarchical Distribution** across cloud and edge resources.
|
||
|
||
1. **Significance (Quantitative):** It partitions the ML workload across the **Latency-Compute Pareto Frontier**, minimizing the **Distance Penalty** ($L_{lat}$) for reactive tasks while utilizing cloud resources ($R_{peak}$) for heavy processing.
|
||
2. **Distinction (Durable):** Unlike **Cloud-Only** or **Edge-Only** deployments, Hybrid ML is defined by **Dynamic Task Offloading** based on resource availability and network status.
|
||
3. **Common Pitfall:** A frequent misconception is that Hybrid ML is just "running two models." In reality, it is a **Unified Data Fabric** where the state must be synchronized across disparate hardware to ensure consistency.
|
||
|
||
:::
|
||
|
||
### Integration Patterns {#sec-ml-systems-integration-patterns-5935}
|
||
|
||
\index{Hybrid ML!train-serve split} \index{Hybrid ML!hierarchical processing} \index{Hybrid ML!progressive deployment}Three essential patterns address common integration challenges:
|
||
|
||
The **Train-Serve Split**\index{train-serve split!economics} places training in the cloud while inference happens on edge, mobile, or tiny devices. This pattern exploits cloud scale for training while benefiting from local inference latency and privacy. Training costs may reach millions of dollars for large models, while inference costs mere cents per query when deployed efficiently.[^fn-train-serve-cost-asymmetry]
|
||
|
||
In **Hierarchical Processing**\index{hierarchical processing!data flow}, data and intelligence flow between computational tiers. TinyML sensors perform basic anomaly detection, edge devices aggregate and analyze data from multiple sensors, and cloud systems handle complex analytics and model updates. Each tier handles tasks appropriate to its capabilities.
|
||
|
||
**Progressive Deployment**\index{progressive deployment!model compression} systematically compresses models for deployment across tiers. A large cloud model becomes progressively optimized versions for edge servers, mobile devices, and tiny sensors. Amazon Alexa exemplifies this pattern: wake-word detection uses <1 KB models consuming <1 mW, while complex natural language understanding requires GB+ models in cloud infrastructure.
|
||
|
||
[^fn-train-serve-cost-asymmetry]: **Train-Serve Cost Asymmetry**: Training is a one-time, compute-intensive search for model parameters, while inference is a single, cheap forward pass using those parameters. This creates the economic rationale for the split, as the massive fixed training cost is amortized over billions of subsequent low-cost inference queries. The resulting cost gap between a multi-million dollar training run and a sub-cent inference can exceed 1,000,000x. \index{Train-Serve Split!cost asymmetry}
|
||
|
||
With three integration patterns available, selecting the right one for a given application requires matching the pattern's trade-off profile to the system's dominant constraints. The following *pattern selection guide* summarizes when each pattern applies.
|
||
|
||
::: {.callout-perspective title="Pattern Selection Guide"}
|
||
|
||
**Train-Serve Split** — *Trade-off: Training cost vs. inference latency*
|
||
|
||
- *Choose when*: Training requires scale that inference does not; privacy matters for inference but not training
|
||
- *Avoid when*: Model needs continuous learning from deployed data
|
||
|
||
**Hierarchical Processing** — *Trade-off: Local autonomy vs. global optimization*
|
||
|
||
- *Choose when*: Data volume exceeds transmission capacity; decisions needed at multiple timescales
|
||
- *Avoid when*: All processing can occur at one tier; network is reliable and fast
|
||
|
||
**Progressive Deployment** — *Trade-off: Model quality vs. deployment reach*
|
||
|
||
- *Choose when*: Same model needed at multiple capability levels; graceful degradation required
|
||
- *Avoid when*: Model cannot be meaningfully compressed; single deployment target
|
||
|
||
**Common combinations**: Voice assistants use Train-Serve Split + Progressive Deployment + Hierarchical Processing. Autonomous vehicles combine Hierarchical Processing with Progressive Deployment to run optimized models at each tier.
|
||
|
||
Additional patterns including federated and collaborative learning enable privacy-preserving distributed training across devices.
|
||
:::
|
||
|
||
### Production System Integration {#sec-ml-systems-production-system-integration-3bb3}
|
||
|
||
\index{Hybrid ML!production systems} \index{data pipelines!hybrid architectures}Real-world implementations integrate multiple design patterns into cohesive solutions. @fig-hybrid makes these interactions concrete through specific connection types. Notice the bidirectional flow: "Deploy" paths show how models flow *downward* from cloud training to various devices, while "Data" and "Results" flow *upward* from sensors through processing stages to cloud analytics. "Sync" connections demonstrate device coordination across tiers. This bidirectional architecture—models flowing down, data flowing up—is the defining characteristic of production hybrid systems.
|
||
|
||
::: {#fig-hybrid fig-env="figure" fig-pos="t" fig-cap="**Hybrid System Interactions**: Data flows upward from sensors through processing layers to cloud analytics, while trained models deploy downward to edge, mobile, and TinyML inference points. Five connection types (deploy, data, results, assist, and sync) establish a distributed architecture where each paradigm contributes unique capabilities." fig-alt="System diagram with four ML paradigms: TinyML sensors, Edge inference, Mobile processing, and Cloud training. Arrows show deploy, data, results, sync, and assist flows between tiers."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
|
||
\tikzset{
|
||
Line/.style={line width=1.0pt,black!50,text=black},
|
||
Box/.style={inner xsep=2pt,
|
||
node distance=0.6,
|
||
draw=GreenLine, line width=0.75pt,
|
||
fill=GreenL,
|
||
text width=20mm,align=flush center,
|
||
minimum width=20mm, minimum height=9mm
|
||
},
|
||
Text/.style={inner xsep=2pt,
|
||
draw=none, line width=0.75pt,
|
||
fill=TextColor,
|
||
font=\footnotesize\usefont{T1}{phv}{m}{n},
|
||
align=flush center,
|
||
minimum width=7mm, minimum height=5mm
|
||
},
|
||
}
|
||
|
||
\node[Box,fill=RedL,draw=RedLine](G2){Training};
|
||
\node[Box,fill=none,draw=none,below =1.2 of G2](A){};
|
||
\node[Box,node distance=2.25, left=of A](B2){Inference};
|
||
\node[Box,node distance=2.25,left=of B2,fill=BlueFill,draw=BlueLine](B1){Inference};
|
||
\node[Box,node distance=2.25, right=of A,fill=OrangeFill,draw=OrangeLine](B3){Inference};
|
||
%
|
||
\node[Box,node distance=1.15, below=of B1,fill=BlueFill,draw=BlueLine](1DB1){Processing};
|
||
\node[Box,node distance=1.15, below=of B3,fill=OrangeFill,draw=OrangeLine](1DB3){Processing};
|
||
\path[](1DB3)-|coordinate(S)(G2);
|
||
\node[Box,node distance=1.5,fill=RedL,draw=RedLine]at(S)(1DB2){Analytics};
|
||
\path[](G2)-|coordinate(SS)(B2);
|
||
\node[Box](G1)at(SS){Sensors};
|
||
%
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=4mm,inner ysep=6mm,anchor= west,
|
||
yshift=1mm,fill=BackColor,fit=(G1)(B2),line width=0.75pt](BB2){};
|
||
\node[below=3pt of BB2.north,anchor=north]{TinyML};
|
||
%
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=4mm,inner ysep=7mm,anchor= west,
|
||
yshift=0mm,fill=BackColor,fit=(G2)(1DB2),line width=0.75pt](BB2){};
|
||
\node[below=3pt of BB2.north,anchor=north]{Cloud ML};
|
||
%
|
||
\draw[Line,-latex](G1.west)--++(180:0.9)|-node[Text,pos=0.1]{Data}(B2);
|
||
\draw[Line,-latex](G2)--++(270:1.20)-|(B2);
|
||
\draw[Line,-latex](G2)--++(270:1.20)-|(B3);
|
||
\draw[Line,-latex](G2)--node[Text,pos=0.46]{Deploy}++(270:1.20)-|(B1);
|
||
%
|
||
\draw[Line,-latex](B1)--node[Text,pos=0.5]{Results}(1DB1);
|
||
\draw[Line,-latex](B2)|-node[Text,pos=0.75]{Results}(1DB1.10);
|
||
%
|
||
\draw[Line,-latex](B1.330)--++(270:0.9)-|node[Text,pos=0.2]{Assist}(B3.220);
|
||
\draw[Line,-latex](B2.east)--node[Text,pos=0.5]{Sync}++(0:5.4)|-(1DB3.170);
|
||
%
|
||
\draw[Line,-latex](1DB1.350)--node[Text,pos=0.75]{Results}(1DB2.190);
|
||
\draw[Line,-latex](1DB3.190)--node[Text,pos=0.50]{Data}(1DB2.350);
|
||
\draw[Line,-latex](B3.290)--node[Text,pos=0.5]{Results}(1DB3.70);
|
||
%
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=4mm,inner ysep=5mm,anchor= west,
|
||
yshift=-2mm,fill=BackColor,fit=(B1)(1DB1),line width=0.75pt](BB2){};
|
||
\node[above=3pt of BB2.south,anchor=south]{Edge ML};
|
||
%
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=4mm,inner ysep=5mm,anchor= west,
|
||
yshift=-2mm,fill=BackColor,fit=(B3)(1DB3),line width=0.75pt](BB2){};
|
||
\node[above=3pt of BB2.south,anchor=south]{Mobile ML};
|
||
\end{tikzpicture}
|
||
```
|
||
:::
|
||
|
||
Production systems demonstrate these integration patterns across diverse applications. Industrial defect detection exemplifies Train-Serve Split: cloud infrastructure trains vision models on datasets from multiple facilities, then distributes optimized versions to edge servers managing factory floors, tablets for quality inspectors, and embedded cameras on production lines. Agricultural monitoring illustrates Hierarchical Processing: soil sensors perform local anomaly detection at the TinyML tier, edge processors aggregate data from dozens of sensors and identify field-level patterns, while cloud infrastructure handles farm-wide analytics and seasonal planning. Fitness tracking exemplifies Progressive Deployment with gateway patterns: wearables continuously monitor activity using microcontroller-optimized algorithms consuming <1 mW, sync processed summaries to smartphones that combine metrics from multiple sources, then transmit periodic updates to cloud infrastructure for longitudinal health analysis.
|
||
|
||
### Why Hybrid Approaches Work {#sec-ml-systems-hybrid-approaches-work-4bb8}
|
||
|
||
\index{Hybrid ML!convergence principles} The success of hybrid architectures stems from a deeper truth: despite their diversity, all ML deployment paradigms share core principles. @fig-ml-systems-convergence illustrates this convergence: implementations spanning cloud to tiny devices meet at the same core system challenges—managing data pipelines, balancing resource constraints, and implementing reliable architectures.
|
||
|
||
::: {#fig-ml-systems-convergence fig-env="figure" fig-pos="t" fig-cap="**Convergence of ML Systems**: Three-layer structure showing how diverse deployments converge. The top layer lists four paradigms (Cloud, Edge, Mobile, TinyML); the middle layer identifies shared foundations (data pipelines, resource management, architecture principles); and the bottom layer presents cross-cutting concerns (optimization, operations, trustworthy AI) that apply across all paradigms." fig-alt="Three-layer diagram. Top: Cloud, Edge, Mobile, TinyML implementations. Middle: data pipeline, resource management, architecture principles. Bottom: optimization, operations, trustworthy AI. Arrows connect layers."}
|
||
```{.tikz}
|
||
\begin{tikzpicture}[font=\small\usefont{T1}{phv}{m}{n}]
|
||
\tikzset{
|
||
Line/.style={line width=1.0pt,black!50,text=black},
|
||
Box/.style={inner xsep=2pt,
|
||
node distance=0.6,
|
||
draw=GreenLine, line width=0.75pt,
|
||
fill=GreenL,
|
||
text width=30mm,align=flush center,
|
||
minimum width=30mm, minimum height=13mm
|
||
},
|
||
Box1/.style={inner xsep=2pt,
|
||
node distance=0.8,
|
||
draw=BlueLine, line width=0.75pt,
|
||
fill=BlueL,
|
||
text width=36mm,align=flush center,
|
||
minimum width=40mm, minimum height=13mm
|
||
},
|
||
}
|
||
|
||
\begin{scope}[anchor=west]
|
||
\node[Box](B1){Cloud ML Data Centers Training at Scale};
|
||
\node[Box,right=of B1](B2){Edge ML Local Processing Inference Focus};
|
||
\node[Box,right=of B2](B3){Mobile ML Personal DevicesUser Applications};
|
||
\node[Box, right=of B3](B4){TinyML Embedded Systems Resource Constrained};
|
||
%
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=5mm,inner ysep=5mm,minimum width=170mm,
|
||
anchor=west,yshift=2mm,fill=BackColor,
|
||
fit=(B1)(B2)(B3)(B4),line width=0.75pt](BB){};
|
||
\node[below=11pt of BB.north east,anchor=east]{ML System Implementations};
|
||
\end{scope}
|
||
%
|
||
\begin{scope}[shift={(0.4,-2.8)}, anchor=west]
|
||
\node[Box1](2B1){Data Pipeline Collection -- Processing -- Deployment};
|
||
\node[Box1,right=of 2B1](2B2){Resource Management Compute -- Memory -- Energy -- Network};
|
||
\node[Box1,right=of 2B2](2B3){System Architecture Models -- Hardware -- Software};
|
||
%
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=5mm,inner ysep=5mm,minimum width=170mm,
|
||
anchor= west,yshift=-1mm,fill=BackColor,fit=(2B1)(2B2)(2B3),line width=0.75pt](BB2){};
|
||
\node[above=8pt of BB2.south east,anchor=east]{Core System Principles};
|
||
\end{scope}
|
||
%
|
||
\begin{scope}[shift={(0.4,-6.0)}, anchor=west]
|
||
\node[Box1, fill=VioletL,draw=VioletLine](3B1){Optimization \& Efficiency Model -- Hardware -- Energy};
|
||
\node[Box1,right=of 3B1, fill=VioletL,draw=VioletLine](3B2){Operational Aspects Deployment -- Monitoring -- Updates};
|
||
\node[Box1,right=of 3B2, fill=VioletL,draw=VioletLine](3B3){Trustworthy AI Security -- Privacy -- Reliability};
|
||
%
|
||
\scoped[on background layer]
|
||
\node[draw=BackLine,inner xsep=5mm,inner ysep=5mm,minimum width=170mm,
|
||
anchor= west,yshift=-1mm,fill=BackColor,fit=(3B1)(3B2)(3B3),line width=0.75pt](BB3){};
|
||
\node[above=8pt of BB3.south east,anchor=east]{System Considerations};
|
||
\end{scope}
|
||
%
|
||
\draw[-latex,Line](B1.south)--++(270:0.75)-|(2B1);
|
||
\draw[-latex,Line](B2.south)--++(270:0.75)-|(2B1);
|
||
\draw[-latex,Line](B3.south)--++(270:0.75)-|(2B1);
|
||
\draw[-latex,Line](B4.south)--++(270:0.75)-|(2B1);
|
||
\draw[-latex,Line](B2.south)--++(270:0.75)-|(2B2);
|
||
\draw[-latex,Line](B3.south)--++(270:0.75)-|(2B3);
|
||
%
|
||
\draw[-latex,Line](2B1.south)--++(270:0.95)-|(3B1);
|
||
\draw[-latex,Line](2B2.south)--++(270:0.95)-|(3B1);
|
||
\draw[-latex,Line](2B3.south)--++(270:0.95)-|(3B1);
|
||
\draw[-latex,Line](2B2.south)--++(270:0.95)-|(3B2);
|
||
\draw[-latex,Line](2B3.south)--++(270:0.95)-|(3B3);
|
||
\end{tikzpicture}
|
||
```
|
||
:::
|
||
|
||
This convergence explains why techniques transfer effectively between scales. Cloud-trained models deploy to edge because both training and inference minimize the same loss function—only the compute budget differs. Quantization techniques developed for edge deployment reduce cloud serving costs, and distributed training strategies inform edge model parallelism.
|
||
|
||
Mobile optimization insights inform cloud efficiency because memory bandwidth constraints appear at every scale. Techniques like operator fusion and activation checkpointing, developed for mobile's tight memory budgets, reduce cloud inference costs by 2-3$\times$ when applied to batch serving. TinyML innovations drive cross-paradigm advances because extreme constraints force genuinely novel algorithmic breakthroughs: binary neural networks, developed for microcontrollers, now accelerate cloud recommendation systems, and sparse attention mechanisms, essential for fitting transformers in kilobytes, reduce cloud training costs.
|
||
|
||
The remaining chapters explore each layer: @sec-data-engineering for data pipelines, @sec-model-compression for optimization, and @sec-ml-operations for operational aspects. All of these apply whether the target is a TPU Pod or an ESP32. However, shared principles also mean shared vulnerabilities: the same operational challenges—data drift, model decay, monitoring—appear at every tier and demand attention before we consider the chapter's remaining lessons.
|
||
|
||
:::: {.callout-checkpoint title="Hybrid ML Patterns"}
|
||
Hybrid architectures work when you partition *work* across tiers—not when you copy the same pipeline everywhere.
|
||
|
||
**Integration Patterns**
|
||
|
||
- [ ] **Train-Serve Split**: Can you explain why training in the cloud and serving on edge/mobile is often economically optimal, even when the model runs locally?
|
||
- [ ] **Hierarchical Processing**: Can you describe what each tier does in a sensor → edge → cloud pipeline, and why pushing *some* decisions down reduces both latency and bandwidth?
|
||
- [ ] **Progressive Deployment**: Can you explain how one model family becomes multiple deployed artifacts (cloud, edge, mobile, tiny) through systematic compression?
|
||
|
||
**Design Sanity Checks**
|
||
|
||
- [ ] **Boundary choice**: Given a concrete application, can you justify *where* the tier boundary should fall (latency, privacy, bandwidth, power), not just *what* model to use?
|
||
- [ ] **Data fabric**: Can you name the minimal data flows that must go *up* (telemetry, labels, drift signals) to keep the deployed system from decaying?
|
||
::::
|
||
|
||
The shared foundations in @fig-ml-systems-convergence also share a vulnerability. Deployment is not the end of the engineering challenge—it is the beginning of a new one. Traditional software, once deployed correctly, remains correct indefinitely: a sorting algorithm that works today will work tomorrow, next year, and a decade from now. ML systems face a fundamentally different reality: **System Entropy (statistical decay)**\index{system entropy!model decay}.
|
||
|
||
\index{Degradation Equation!distribution shift}
|
||
Unlike a sorting algorithm that remains correct as long as the code is unchanged, an ML model's accuracy degrades as the world drifts away from its training distribution. The **Degradation Equation** from @sec-introduction captures this formally: system quality decays as the distance between the training distribution and the live data distribution grows, at a rate proportional to the model's sensitivity to distributional shift. Every deployed model is in a state of unobserved decay from the moment it ships. Reliability in ML systems is therefore not a property of the code but a property of the monitoring and retraining infrastructure built to detect and correct this drift. The operational aspects covered in @sec-ml-operations address precisely this challenge.
|
||
|
||
::: {.callout-war-story title="The Zillow Offers Collapse (2021)"}
|
||
**The Context**: Zillow, a real-estate marketplace, launched "Zillow Offers" to buy homes directly using an algorithmic valuation model ("Zestimate").
|
||
|
||
**The Failure**: The model was trained on historical data during a stable market. When the market became volatile (rapid price shifts during COVID-19), the model failed to adapt to the distribution shift. It overpaid for thousands of homes that it could not resell at a profit.
|
||
|
||
**The Consequence**: Zillow wrote down \$304 million in inventory, laid off 25% of its workforce (2,000 people), and shut down the Offers division entirely.
|
||
|
||
**The Systems Lesson**: Distribution shift is not just a metric drop; it is a business risk. Automated decision-making systems interacting with dynamic markets require rapid feedback loops and circuit breakers, not just accurate offline models.
|
||
:::
|
||
|
||
Zillow's collapse is not merely a cautionary tale. It is evidence for why ML systems engineering must exist as a principled discipline. The failure was not one of model accuracy but of *systems reasoning*: the inability to trace how distributional shift propagates from market data through a valuation model into irreversible financial commitments. A discipline built on the Statistical Drift Invariant and the Degradation Equation makes such propagation paths visible and such failure modes quantifiable *before* they compound into \$304 million losses.
|
||
|
||
Beyond statistical decay, engineers also fall prey to common misconceptions about ML deployment. The physical constraints we have examined throughout this chapter create counterintuitive behaviors that challenge intuitions from traditional software engineering. The following fallacies and pitfalls distill these hard-won lessons into actionable guidance.
|
||
|
||
## Fallacies and Pitfalls {#sec-ml-systems-fallacies-pitfalls-3dfe}
|
||
|
||
The following fallacies and pitfalls capture architectural mistakes that waste development resources, miss performance targets, or deploy systems critically mismatched to their operating constraints. Each represents a pattern we have seen repeatedly in production ML systems.
|
||
|
||
**Fallacy:** *One deployment paradigm solves all ML problems.*
|
||
|
||
Physical constraints create hard boundaries that no single paradigm can span. As @sec-ml-systems-system-balance-hardware-96ab establishes, memory bandwidth scales as the square root of chip area (constrained by die perimeter and pin count) while compute scales linearly with die area, producing qualitatively different bottlenecks across paradigms. @tbl-big_vs_tiny quantifies this: cloud ML achieves 100--1000 ms latency while TinyML delivers 1--10 ms, a 100$\times$ difference rooted in speed-of-light limits, not implementation quality. A real-time robotics system requiring sub-10 ms response cannot use cloud inference regardless of optimization, and a billion-parameter language model cannot fit on a microcontroller with 256 KB RAM regardless of quantization. The optimal architecture typically combines paradigms, such as cloud training with edge inference or mobile preprocessing with cloud analysis.
|
||
|
||
A related misconception holds that moving computation closer to the user always reduces latency, ignoring the processing overhead introduced by less powerful edge hardware—a trade-off explored in **Inference Benchmarks** (@sec-benchmarking-inference-benchmarks-2c1f).
|
||
|
||
```{python}
|
||
#| label: mobile-power-fallacy-calc
|
||
#| echo: false
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ MOBILE POWER FALLACY: BATTERY DEPLETION CALCULATIONS
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Fallacy "Model optimization overcomes mobile device power limits"
|
||
# │
|
||
# │ Goal: Demonstrate the physical limits of battery-powered inference.
|
||
# │ Show: That a 5W workload depletes a standard phone battery in 3 hours.
|
||
# │ How: Calculate runtime from power draw and standard Wh capacity.
|
||
# │
|
||
# │ Imports: mlsysim.book (fmt, md_frac)
|
||
# │ Exports: MobilePowerFallacyCalc.low_power_hours_str,
|
||
# │ MobilePowerFallacyCalc.high_power_hours_str,
|
||
# │ MobilePowerFallacyCalc.low_power_frac,
|
||
# │ MobilePowerFallacyCalc.high_power_frac
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim.fmt import fmt_percent, fmt, check, md_frac
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class MobilePowerFallacyCalc:
|
||
"""Namespace for Mobile Power Fallacy Calc."""
|
||
|
||
battery_wh_value = 15 # Wh, typical smartphone
|
||
low_power_w_value = 1 # W, light inference
|
||
high_power_w_value = 5 # W, heavy on-device model
|
||
|
||
low_power_hours_value = battery_wh_value / low_power_w_value # 15 / 1 = 15 hours
|
||
high_power_hours_value = battery_wh_value / high_power_w_value # 15 / 5 = 3 hours
|
||
|
||
low_power_hours_str = fmt(low_power_hours_value, precision=0, commas=False) # "15"
|
||
high_power_hours_str = fmt(high_power_hours_value, precision=0, commas=False) # "3"
|
||
|
||
# --- Inline fractions showing the physics ---
|
||
low_power_frac = md_frac(f"{battery_wh_value} Wh", f"{low_power_w_value} W", f"**{low_power_hours_str} hours**")
|
||
high_power_frac = md_frac(f"{battery_wh_value} Wh", f"{high_power_w_value} W", f"**{high_power_hours_str} hours**")
|
||
```
|
||
|
||
**Fallacy:** *Model optimization overcomes mobile device power and thermal limits.*
|
||
|
||
Compression techniques do not scale indefinitely against physics. Consider a smartphone with a `{python} MobileBatteryCapacity.phone_battery_str` Wh battery:
|
||
|
||
- **Light workload** (1 W inference): `{python} MobilePowerFallacyCalc.low_power_frac`
|
||
- **Heavy workload** (5 W, common for large on-device models): `{python} MobilePowerFallacyCalc.high_power_frac`
|
||
|
||
The 5 W workload also triggers thermal throttling that reduces performance by 40–60 percent. As @sec-ml-systems-mobile-ml-benefits-resource-constraints-c568 establishes, sustained mobile inference cannot exceed 2–3 W without active cooling. Reducing numerical precision (using fewer bits to represent each weight; see @sec-model-compression) cuts power by approximately 4$\times$, but aggressive precision reduction often causes 5–10 percent accuracy loss. Applications requiring continuous inference beyond mobile thermal envelopes remain physically impossible regardless of algorithmic improvements.
|
||
|
||
**Fallacy:** *TinyML represents scaled-down mobile ML.*
|
||
|
||
The difference is qualitative, not just quantitative. As @sec-ml-systems-tinyml-advantages-operational-tradeoffs-2d40 establishes, TinyML microcontrollers provide 256 KB to 1 MB of memory versus mobile devices with 4–12 GB, a 10,000$\times$ difference requiring entirely different algorithms. Mobile ML uses reduced-precision arithmetic with minimal accuracy loss; TinyML requires extreme precision reduction that sacrifices 10–15 percent accuracy for 32$\times$ memory reduction. Mobile devices run models with millions of parameters; TinyML models contain 10,000–100,000 parameters, demanding distinct architectural choices such as specialized lightweight operations designed to minimize multiply-accumulate counts. Power budgets show similar discontinuities: mobile inference consumes 1–5 W, while TinyML targets 1–10 mW for battery-free energy harvesting. These thousand-fold gaps make TinyML a distinct problem class, not a smaller version of mobile ML. Teams that apply mobile optimization techniques directly to TinyML projects discover that quantization from FP32 to INT8 (reducing each weight from 32 bits to 8 bits; see @sec-model-compression) is insufficient when models must fit in 64 KB, forcing complete architectural redesign.
|
||
|
||
```{python}
|
||
#| label: tco-pitfall-calc
|
||
#| echo: false
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ TCO PITFALL: EDGE VS CLOUD TOTAL COST OF OWNERSHIP
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Pitfall "Minimizing computational resources minimizes total cost"
|
||
# │
|
||
# │ Goal: Demonstrate why minimizing compute doesn't always minimize TCO.
|
||
# │ Show: That edge deployments can have 3× higher total cost due to OpEx.
|
||
# │ How: Model CapEx and OpEx for a 100-unit edge fleet.
|
||
# │
|
||
# │ Imports: mlsysim.book (fmt)
|
||
# │ Exports: TcoPitfallCalc.cloud_compute_str, TcoPitfallCalc.edge_hw_str,
|
||
# │ TcoPitfallCalc.edge_network_str, TcoPitfallCalc.edge_maint_str,
|
||
# │ TcoPitfallCalc.edge_reliability_str, TcoPitfallCalc.edge_total_str,
|
||
# │ TcoPitfallCalc.tco_ratio_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class TcoPitfallCalc:
|
||
"""Namespace for Tco Pitfall Calc."""
|
||
|
||
# Cloud costs (monthly)
|
||
cloud_compute_value = 2000 # $, inference compute
|
||
|
||
# Edge costs (monthly)
|
||
edge_hardware_value = 500 # $, amortized hardware
|
||
edge_network_value = 3000 # $, network engineering
|
||
edge_maintenance_value = 500 # $, hardware maintenance
|
||
edge_reliability_value = 2000 # $, reliability engineering
|
||
|
||
edge_total_value = (edge_hardware_value + edge_network_value +
|
||
edge_maintenance_value + edge_reliability_value) # $6,000
|
||
|
||
tco_ratio_value = edge_total_value / cloud_compute_value # 3x
|
||
|
||
cloud_compute_str = fmt(cloud_compute_value, precision=0, commas=True) # "2,000"
|
||
edge_hw_str = fmt(edge_hardware_value, precision=0, commas=False) # "500"
|
||
edge_network_str = fmt(edge_network_value, precision=0, commas=True) # "3,000"
|
||
edge_maint_str = fmt(edge_maintenance_value, precision=0, commas=False) # "500"
|
||
edge_reliability_str = fmt(edge_reliability_value, precision=0, commas=True) # "2,000"
|
||
edge_total_str = fmt(edge_total_value, precision=0, commas=True) # "6,000"
|
||
tco_ratio_str = fmt(tco_ratio_value, precision=0, commas=False) # "3"
|
||
```
|
||
|
||
**Pitfall:** *Minimizing computational resources minimizes total cost.*
|
||
|
||
Teams optimize per-unit resource consumption while ignoring operational overhead and development velocity. As the decision framework in @sec-ml-systems-decision-framework-241f emphasizes, paradigm selection requires evaluating total cost of ownership, not just compute costs. A cloud inference service costing $`{python} TcoPitfallCalc.cloud_compute_str` monthly in compute appears expensive versus $`{python} TcoPitfallCalc.edge_hw_str` monthly edge hardware amortization, but edge deployments add network engineering ($`{python} TcoPitfallCalc.edge_network_str` monthly), hardware maintenance ($`{python} TcoPitfallCalc.edge_maint_str` monthly), and reliability engineering ($`{python} TcoPitfallCalc.edge_reliability_str` monthly), totaling $`{python} TcoPitfallCalc.edge_total_str`---a `{python} TcoPitfallCalc.tco_ratio_str`$\times$ difference. Development velocity compounds the gap: cloud deployments reaching production in 2 months versus 6 months for custom edge infrastructure represent 4 months of delayed revenue. The optimal cost solution requires total cost of ownership analysis including development time, operational complexity, and opportunity costs, not merely minimizing compute expenses.
|
||
|
||
```{python}
|
||
#| label: amdahl-camera-calc
|
||
#| echo: false
|
||
|
||
# ┌─────────────────────────────────────────────────────────────────────────────
|
||
# │ AMDAHL'S LAW: CAMERA PIPELINE EXAMPLE
|
||
# ├─────────────────────────────────────────────────────────────────────────────
|
||
# │ Context: Fallacy "Model optimization translates linearly to system speedup"
|
||
# │
|
||
# │ Goal: Demonstrate Amdahl's Law in a smartphone camera pipeline.
|
||
# │ Show: That a 10× model speedup yields only a 1.37× end-to-end improvement.
|
||
# │ How: Calculate total latency before and after local classifier optimization.
|
||
# │
|
||
# │ Imports: mlsysim.book (fmt)
|
||
# │ Exports: AmdahlCameraCalc.cam_isp_str, AmdahlCameraCalc.cam_ml_str,
|
||
# │ AmdahlCameraCalc.cam_post_str, AmdahlCameraCalc.cam_total_str,
|
||
# │ AmdahlCameraCalc.cam_ml_pct_str, AmdahlCameraCalc.cam_non_ml_pct_str,
|
||
# │ AmdahlCameraCalc.cam_speedup_10x_str, AmdahlCameraCalc.cam_speedup_inf_str,
|
||
# │ AmdahlCameraCalc.cam_ml_opt_str, AmdahlCameraCalc.cam_total_opt_str
|
||
# └─────────────────────────────────────────────────────────────────────────────
|
||
|
||
from mlsysim.fmt import fmt_percent, fmt, check
|
||
|
||
# ┌── LEGO ───────────────────────────────────────────────
|
||
class AmdahlCameraCalc:
|
||
"""Namespace for Amdahl Camera Calc."""
|
||
|
||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||
cam_isp_ms_value = 100 # ms, ISP + auto-exposure
|
||
cam_ml_ms_value = 60 # ms, ML scene classification
|
||
cam_post_ms_value = 40 # ms, tone mapping + HDR merge
|
||
cam_ml_speedup_value = 10 # 10× faster ML model
|
||
|
||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||
cam_total_ms_value = cam_isp_ms_value + cam_ml_ms_value + cam_post_ms_value # 200 ms
|
||
cam_ml_frac_value = cam_ml_ms_value / cam_total_ms_value # 0.30
|
||
cam_non_ml_frac_value = 1 - cam_ml_frac_value # 0.70
|
||
|
||
cam_speedup_10x_value = 1 / (cam_non_ml_frac_value + cam_ml_frac_value / cam_ml_speedup_value)
|
||
cam_speedup_inf_value = 1 / cam_non_ml_frac_value # theoretical max
|
||
cam_ml_optimized_ms_value = cam_ml_ms_value / cam_ml_speedup_value # 6 ms
|
||
cam_total_optimized_ms_value = (cam_isp_ms_value +
|
||
cam_ml_optimized_ms_value +
|
||
cam_post_ms_value) # 146 ms
|
||
|
||
# ┌── 4. OUTPUT (Formatting) ─────────────────────────────────────────────
|
||
cam_isp_str = fmt(cam_isp_ms_value, precision=0, commas=False) # "100"
|
||
cam_ml_str = fmt(cam_ml_ms_value, precision=0, commas=False) # "60"
|
||
cam_post_str = fmt(cam_post_ms_value, precision=0, commas=False) # "40"
|
||
cam_total_str = fmt(cam_total_ms_value, precision=0, commas=False) # "200"
|
||
cam_ml_pct_str = fmt_percent(cam_ml_frac_value, precision=0, commas=False) # "30"
|
||
cam_non_ml_pct_str = fmt_percent(cam_non_ml_frac_value, precision=0, commas=False) # "70"
|
||
cam_speedup_10x_str = fmt(cam_speedup_10x_value, precision=2, commas=False) # "1.37"
|
||
cam_speedup_inf_str = fmt(cam_speedup_inf_value, precision=2, commas=False) # "1.43"
|
||
cam_ml_opt_str = fmt(cam_ml_optimized_ms_value, precision=0, commas=False) # "6"
|
||
cam_total_opt_str = fmt(cam_total_optimized_ms_value, precision=0, commas=False) # "146"
|
||
```
|
||
|
||
**Fallacy:** *Model optimization translates linearly to system speedup.*
|
||
|
||
Amdahl's Law\index{Amdahl's Law!speedup limits}\index{optimization!Amdahl's Law}[^fn-amdahls-law-pipeline] establishes hard limits that the Bottleneck Principle (@sec-ml-systems-bottleneck-principle-3514) formalizes: $Speedup_{overall} = \frac{1}{(1-p) + \frac{p}{s}}$ where $p$ is the fraction of work that can be improved and $s$ is the speedup of that fraction. Consider tapping the shutter on a smartphone camera. The image passes through `{python} AmdahlCameraCalc.cam_isp_str` ms of signal processing (auto-exposure, white balance), `{python} AmdahlCameraCalc.cam_ml_str` ms of ML scene classification, and `{python} AmdahlCameraCalc.cam_post_str` ms of post-processing (tone mapping, HDR merge)---`{python} AmdahlCameraCalc.cam_total_str` ms total. Optimizing the ML classifier to run 10$\times$ faster (`{python} AmdahlCameraCalc.cam_ml_opt_str` ms instead of `{python} AmdahlCameraCalc.cam_ml_str` ms), but total time drops from `{python} AmdahlCameraCalc.cam_total_str` ms to `{python} AmdahlCameraCalc.cam_total_opt_str` ms---only `{python} AmdahlCameraCalc.cam_speedup_10x_str`$\times$ overall, not 10$\times$. Even eliminating ML entirely ($s = \infty$) achieves only `{python} AmdahlCameraCalc.cam_speedup_inf_str`$\times$ speedup, because the remaining `{python} AmdahlCameraCalc.cam_non_ml_pct_str` percent of the pipeline is untouched. Effective optimization requires profiling the entire pipeline and addressing bottlenecks systematically, because system performance depends on the slowest unoptimized stage.
|
||
|
||
[^fn-amdahls-law-pipeline]: **Amdahl's Law**: Formalized by Gene Amdahl in 1967 for multiprocessor scaling, this principle applies directly to ML deployment pipelines where the model is only one stage among many. The camera example illustrates the general pattern: ML inference rarely exceeds 30--50% of total pipeline time in production systems, meaning even a 100$\times$ model speedup yields at most a 2--3$\times$ end-to-end improvement. Teams that benchmark model latency in isolation systematically overestimate deployment gains. \index{Amdahl's Law!pipeline bottleneck}
|
||
|
||
**Pitfall:** *Assuming more training data always improves deployed model performance.*
|
||
|
||
\index{scaling laws!data limitations}Three constraints limit data scaling benefits, as the workload archetypes in @sec-ml-systems-analyzing-workloads-cbb8 illustrate. First, model size limits what can be learned: a keyword spotting model with 250K parameters achieves 95% accuracy on 50K samples but only 96.5% on 1M samples, a 0.3% gain for 5$\times$ more data, storage, and labeling cost. The model simply cannot represent more complex patterns. Second, data quality dominates quantity: 1M curated samples often outperform 100M noisy web-scraped samples, because mislabeled examples and misleading patterns degrade performance even as dataset size grows. Third, deployment distribution matters more than training scale: a model trained on 1B web images may perform worse on medical imaging than one trained on 100K domain-specific samples. Teams that maximize dataset scale without analyzing model capacity waste months of labeling effort for negligible accuracy gains.
|
||
|
||
**Pitfall:** *Deploying the same model binary across all edge devices without hardware-specific optimization.*
|
||
|
||
Teams build a single model artifact and deploy it identically to every target device, treating deployment as a packaging step rather than an optimization opportunity. In practice, hardware-specific optimizations yield 3--5$\times$ efficiency gains that generic binaries cannot capture. An INT8 model running on a device with a dedicated Neural Processing Unit (NPU) achieves 3--4$\times$ higher throughput per watt than the same model running in FP32 on a general-purpose CPU, because the NPU's fixed-function INT8 datapaths avoid the energy overhead of floating-point arithmetic. Similarly, operator fusion and memory layout tuning for a specific accelerator's cache hierarchy can halve inference latency without changing the model's weights. As the deployment paradigm analysis in @sec-ml-systems-deployment-paradigm-framework-0d25 establishes, each paradigm imposes distinct hardware constraints; a model binary optimized for an Arm Cortex-A78 will underutilize the matrix acceleration units on a device equipped with an Arm Ethos-U NPU. Teams that skip per-target optimization either waste battery life on mobile devices or fail to meet latency SLAs on edge hardware, forcing costly post-deployment remediation.
|
||
|
||
## Summary {#sec-ml-systems-summary-d75c}
|
||
|
||
This chapter answered a deceptively simple question: *why does the same model demand fundamentally different engineering on a phone versus a datacenter?* *The answer is physics.* Three immutable constraints—the speed of light, the power wall, and the memory wall—carve the deployment landscape into four distinct paradigms spanning nine orders of magnitude in power and memory. No single paradigm suffices for production systems; hybrid architectures that partition work across Cloud, Edge, Mobile, and TinyML tiers define the state of the art.
|
||
|
||
::: {.callout-takeaways title="Same Model, Different Engineering"}
|
||
|
||
* **Physical constraints are permanent**\index{physical constraints!permanent boundaries}: Speed of light (~36 ms cross-country round-trip), power wall, and memory wall create hard boundaries that engineering cannot overcome—only navigate.
|
||
* **Identify bottlenecks before optimizing**\index{bottleneck principle!optimization strategy}: The same model is compute-bound in training but memory-bound in inference. The Iron Law and Bottleneck Principle pinpoint which constraint dominates; optimizing the wrong term yields zero speedup.
|
||
* **Workload archetypes determine deployment feasibility**: A Compute Beast (ResNet-50 training) requires cloud scale; a Tiny Constraint (keyword spotting) requires microcontroller efficiency. The same optimization strategy cannot serve both—match the archetype to the paradigm.
|
||
* **The deployment spectrum spans 1,000,000$\times$ in energy**: Cloud (1 kW) to TinyML (1 mW). This gap enables entirely different application classes rather than representing a limitation.
|
||
* **Hybrid architectures are prevalent in production systems**\index{hybrid architectures!voice assistant example}: Voice assistants span TinyML (wake-word), Mobile (speech-to-text), and Cloud (language understanding). Rarely does one paradigm suffice; integration patterns (Train-Serve Split, Hierarchical Processing, Progressive Deployment) formalize how paradigms combine.
|
||
* **Latency budgets reveal feasibility**\index{latency budgets!feasibility analysis}: 100 ms round-trip to cloud eliminates real-time applications; 10 ms edge inference enables them. Apply the decision framework (@fig-mlsys-playbook-flowchart) to filter paradigms by privacy, latency, compute, and cost.
|
||
* **System-level speedup obeys Amdahl's Law, not model-level gains**\index{Amdahl's Law!system optimization}: A 10$\times$ faster model yields only 1.37$\times$ system speedup when ML accounts for 30% of the pipeline. Profile the full system before optimizing any component.
|
||
* **Universal system principles transfer across paradigms**: Data pipelines, resource management, and system architecture recur at every scale, which is why optimization ideas can migrate from cloud to edge and back again.
|
||
|
||
:::
|
||
|
||
The analytical tools developed here—the Iron Law, Bottleneck Principle, Workload Archetypes, and Lighthouse Models—recur throughout the remainder of this book. Every subsequent chapter, from data engineering to model compression to serving, operates within the deployment constraints established here. The decision framework (@fig-mlsys-playbook-flowchart) and the quantitative comparison (@tbl-big_vs_tiny) provide the reference points for those discussions. Knowing *where* to deploy is only the beginning. Every deployed model faces **System Entropy**—accuracy degradation as the world drifts from its training distribution—making the operational infrastructure for monitoring and retraining as important as the deployment decision itself.
|
||
|
||
::: {.callout-chapter-connection title="From Theory to Process"}
|
||
|
||
Understanding *where* ML systems run provides the foundation for understanding *how* to build them. @sec-ml-workflow establishes the systematic development process that guides ML systems from conception through deployment, translating the physical constraints examined here into reliable, production-ready systems.
|
||
|
||
:::
|
||
|
||
::: {.quiz-end}
|
||
:::
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| label: chapter-end
|
||
```
|