mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-01 01:59:10 -05:00
- Add Anchor 7 to validate Optimizer convergence against Llama 3 strategy. - Add Case Study R4 detailing automated parallelism search via Tier 3 Optimizer. - Expand Section 5.3 to explicitly define how Optimizers span across the 22 Walls taxonomy. - Update Future Work to reframe multi-objective searches as Tier 3 Pareto Frontiers. - Unify terminology globally: replace generic 'solvers' with 'resolvers' to respect the new 3-Tier semantics (Models, Solvers, Optimizers). - Update Listing 2 comments to map directly to Layer A (Demand) and Layer D (Supply).
1054 lines
118 KiB
TeX
1054 lines
118 KiB
TeX
\documentclass[10pt,twocolumn]{article}
|
|
|
|
% Adjust line spacing for better readability
|
|
\renewcommand{\baselinestretch}{1.05}
|
|
|
|
% Reduce widows, orphans, and excessive hyphenation
|
|
\widowpenalty=10000 % Prevent single lines at top of page/column
|
|
\clubpenalty=10000 % Prevent single lines at bottom of page/column
|
|
\hyphenpenalty=300 % Discourage hyphenation (higher = fewer hyphens)
|
|
\tolerance=1000 % Allow slightly looser spacing to avoid hyphens
|
|
|
|
% Essential packages
|
|
\usepackage[T1]{fontenc}
|
|
\usepackage{mathpazo} % Palatino for main text and math
|
|
\usepackage[scaled=0.9]{helvet} % Helvetica for sans-serif
|
|
\usepackage{courier} % Courier for monospace
|
|
\usepackage{microtype}
|
|
\usepackage{graphicx}
|
|
\usepackage{amsmath}
|
|
\usepackage{amssymb}
|
|
\usepackage{booktabs}
|
|
\usepackage{tabularx}
|
|
\usepackage{xcolor}
|
|
\usepackage{listings}
|
|
\usepackage[round,authoryear]{natbib}
|
|
\bibliographystyle{plainnat}
|
|
\usepackage{hyperref}
|
|
\usepackage{cleveref}
|
|
\usepackage{tikz}
|
|
\usetikzlibrary{shapes,arrows,positioning,shadows,calc,backgrounds,decorations.pathreplacing}
|
|
\usepackage{pgfplots}
|
|
\pgfplotsset{compat=1.18}
|
|
\usepackage{subcaption}
|
|
\usepackage{enumitem}
|
|
\usepackage{titlesec}
|
|
\titlespacing*{\section}{0pt}{10pt plus 2pt minus 2pt}{4pt plus 1pt minus 1pt}
|
|
\titlespacing*{\subsection}{0pt}{8pt plus 2pt minus 2pt}{3pt plus 1pt minus 1pt}
|
|
\titlespacing*{\subsubsection}{0pt}{6pt plus 2pt minus 1pt}{2pt plus 1pt minus 1pt}
|
|
\usepackage{fancyhdr}
|
|
\usepackage{xspace}
|
|
\usepackage[section]{placeins}
|
|
|
|
% Allow more floats per page and prevent deferral to end
|
|
\renewcommand{\topfraction}{0.95}
|
|
\renewcommand{\bottomfraction}{0.8}
|
|
\renewcommand{\textfraction}{0.05}
|
|
\renewcommand{\floatpagefraction}{0.8}
|
|
\renewcommand{\dbltopfraction}{0.95}
|
|
\renewcommand{\dblfloatpagefraction}{0.8}
|
|
\setcounter{topnumber}{4}
|
|
\setcounter{dbltopnumber}{3}
|
|
\setcounter{totalnumber}{6}
|
|
\setcounter{topnumber}{4}
|
|
\setcounter{bottomnumber}{4}
|
|
\setcounter{totalnumber}{8}
|
|
|
|
% Tighten float-to-text spacing to reduce white space
|
|
\setlength{\textfloatsep}{8pt plus 2pt minus 2pt}
|
|
\setlength{\floatsep}{6pt plus 2pt minus 2pt}
|
|
\setlength{\intextsep}{6pt plus 2pt minus 2pt}
|
|
\setlength{\dbltextfloatsep}{8pt plus 2pt minus 2pt}
|
|
\setlength{\dblfloatsep}{6pt plus 2pt minus 2pt}
|
|
|
|
% Tighten spacing around equations
|
|
\setlength{\abovedisplayskip}{6pt plus 2pt minus 2pt}
|
|
\setlength{\belowdisplayskip}{6pt plus 2pt minus 2pt}
|
|
\setlength{\abovedisplayshortskip}{2pt plus 1pt}
|
|
\setlength{\belowdisplayshortskip}{4pt plus 1pt minus 1pt}
|
|
|
|
% Branding: styled product name (display) vs. plain text (code)
|
|
\newcommand{\mlsysim}{\mbox{\textsc{MLSys}\,\textperiodcentered\,\textsc{im}}\xspace}
|
|
\newcommand{\MLSYSIM}{\textbf{\mlsysim}}
|
|
|
|
% Coverage matrix symbols
|
|
\newcommand{\fullmark}{\checkmark}
|
|
\newcommand{\halfmark}{$\circ$}
|
|
\newcommand{\emptymark}{--}
|
|
|
|
% Page geometry
|
|
\usepackage[
|
|
letterpaper,
|
|
top=0.75in,
|
|
bottom=1in,
|
|
left=0.75in,
|
|
right=0.75in,
|
|
columnsep=0.25in
|
|
]{geometry}
|
|
|
|
% Python code highlighting
|
|
\definecolor{codegreen}{rgb}{0,0.5,0}
|
|
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
|
|
\definecolor{codepurple}{rgb}{0.44,0.0,0.55}
|
|
\definecolor{codeblue}{rgb}{0.0,0.25,0.55}
|
|
\definecolor{backcolour}{rgb}{0.975,0.975,0.985}
|
|
\definecolor{rulecolour}{rgb}{0.78,0.78,0.82}
|
|
|
|
\lstset{
|
|
backgroundcolor=\color{backcolour},
|
|
commentstyle=\color{codegreen}\itshape,
|
|
keywordstyle=\color{codeblue}\bfseries,
|
|
numberstyle=\scriptsize\color{codegray},
|
|
stringstyle=\color{codepurple},
|
|
basicstyle=\ttfamily\fontsize{7.5}{9.5}\selectfont,
|
|
breakatwhitespace=false,
|
|
breaklines=true,
|
|
captionpos=b,
|
|
keepspaces=true,
|
|
numbers=left,
|
|
numbersep=6pt,
|
|
showspaces=false,
|
|
showstringspaces=false,
|
|
showtabs=false,
|
|
tabsize=4,
|
|
language=Python,
|
|
frame=single,
|
|
rulecolor=\color{rulecolour},
|
|
framesep=4pt,
|
|
xleftmargin=12pt,
|
|
framexleftmargin=12pt,
|
|
aboveskip=8pt,
|
|
belowskip=6pt,
|
|
abovecaptionskip=6pt,
|
|
belowcaptionskip=2pt,
|
|
morekeywords={print,from,import,as},
|
|
literate={->}{$\rightarrow$}1 {>=}{$\geq$}1 {<=}{$\leq$}1
|
|
}
|
|
|
|
\hypersetup{
|
|
colorlinks=true,
|
|
linkcolor=blue,
|
|
citecolor=blue,
|
|
urlcolor=blue,
|
|
pdftitle={MLSys·im: First-Principles Infrastructure Modeling for Machine Learning Systems},
|
|
pdfauthor={Vijay Janapa Reddi}
|
|
}
|
|
|
|
\title{
|
|
{\Huge\bfseries \mlsysim}\\[0.4em]
|
|
\Large\normalfont\itshape First-Principles Infrastructure Modeling for Machine Learning Systems
|
|
}
|
|
|
|
\author{
|
|
\fontsize{12}{15}\selectfont
|
|
Vijay Janapa Reddi\\[0.2em]
|
|
\fontsize{11}{14}\selectfont
|
|
Harvard University\\[0.6em]
|
|
\fontsize{10}{12}\selectfont
|
|
\textcolor{gray!60}{\href{https://mlsysbook.ai/mlsysim}{mlsysbook.ai/mlsysim}}
|
|
}
|
|
|
|
\date{}
|
|
|
|
\begin{document}
|
|
|
|
\maketitle
|
|
\vspace{-1.5em}
|
|
|
|
\begin{abstract}
|
|
As machine learning models transition from laboratory curiosities to critical infrastructure, the systems required to sustain them have reached a point of extreme bimodal complexity. Developers must reason about constraints spanning from sub-milliwatt microcontrollers to multi-gigawatt datacenter fleets. Existing evaluation methodologies are polarized between hardware-dependent empirical profiling and cycle-accurate simulation, leaving a void for rapid, full-stack architectural reasoning. We present \MLSYSIM{} (\textbf{M}achine \textbf{L}earning \textbf{Sys}tems \textbf{I}nfrastructure \textbf{M}odeling), a first-principles analytical modeling framework that formalizes the ``physics of systems'' into a dimensionally-strict Python engine. \mlsysim introduces a 5-layer \emph{Progressive Lowering} abstraction that decouples computational demand from silicon supply and environmental context. Strict enforcement of unit-level integrity at runtime eliminates the silent conversion errors that plague ad-hoc systems modeling. We codify a complete taxonomy of 22 ``Systems Walls''---hard physical or logical constraints spanning compute ceilings, memory bandwidth, network topology, data pipelines, scaling laws, fleet reliability, economics, and sustainability---organized into six domains and resolved by a suite of 25 resolvers (20 analytical models, 2 analysis solvers, and 3 optimizers). Our evaluation demonstrates that \mlsysim enables sub-second design-space exploration, identifying binding constraints and synthesizing ideal hardware specifications across the entire ML systems lifecycle.
|
|
\end{abstract}
|
|
|
|
\section{Introduction}
|
|
\label{sec:intro}
|
|
|
|
Machine learning has become infrastructure~\citep{sutton2019bitter}. Training a frontier model now requires orchestrating tens of thousands of accelerators across datacenter fabrics where memory ceilings, network bandwidth, power delivery, and regional carbon intensities interact in non-obvious ways~\citep{dean2012large,shoeybi2019megatron}. Yet the hardware required to develop intuition for these systems is prohibitively scarce: a student cannot requisition a 100{,}000-GPU cluster to explore how topology affects AllReduce latency, and a researcher cannot easily sweep parallelism strategies across hardware generations. What is missing is a \emph{complete taxonomy} of the constraints that bind ML system performance and a framework fast enough for interactive exploration.
|
|
|
|
Consider a concrete example. A team deploying LLaMA-3 70B for interactive serving must answer: \emph{How many H100 GPUs are needed to meet a 50\,ms time-to-first-token SLA at 95th-percentile latency?} The answer depends on at least seven interacting constraints---each a ``wall,'' a hard bound imposed by physics, economics, or algorithmic scaling (\Cref{tab:walls}): the model's 70 billion parameters require $\sim$140\,GB in FP16, exceeding a single GPU's 80\,GB HBM capacity (Wall~2: Memory). Tensor-parallel sharding across two GPUs introduces NVLink synchronization overhead (Wall~14: Communication). Continuous batching with PagedAttention determines KV-cache memory utilization (Wall~5: Batching). The decode phase is memory-bandwidth-bound at 3.35\,TB/s per device (Wall~4: Serving). Tail latency under load follows Erlang-C queueing dynamics (Wall~7: Tail Latency). And the fleet's total cost of ownership constrains what is economically viable (Wall~17: Capital).
|
|
|
|
\textbf{The fidelity--speed--scope void.}
|
|
Existing tools are constrained along three axes---fidelity, speed, and scope---and no tool occupies the region where all three are adequate (\Cref{tab:comparison}). Empirical profilers require physical silicon; cycle-accurate simulators like ASTRA-sim~2.0~\citep{won2023astrasim2} require hours per configuration; analytical tools like Calculon~\citep{calculon2023} achieve speed but focus narrowly on LLM training, ignoring data pipelines, reliability, sustainability, and inference. None enforce dimensional correctness, leaving practitioners vulnerable to silent unit-conversion errors~\citep{parashar2019timeloop,wu2019accelergy}. Patterson and Hennessy faced an analogous gap in computer architecture education: they gave students not cycle-accurate x86 simulators but a \emph{taxonomically complete} instruction set (MIPS) that exposed every architectural concept through a model simple enough to reason about yet faithful enough to develop correct mental models~\citep{hennessy2024architecture}. \mlsysim aspires to the same role for ML systems.
|
|
|
|
\textbf{\mlsysim.}
|
|
We present \MLSYSIM{} (Machine Learning Systems Infrastructure Modeling), an open-source, pure-Python analytical framework that formalizes back-of-the-envelope ML systems reasoning into a dimensionally strict, composable engine. The framework codifies 22 ``Systems Walls'' into 25 resolvers (20 models, 2 solvers, and 3 optimizers) organized across six domains: Node, Data, Algorithm, Fleet, Operations, and Analysis. It separates computational \emph{demand} from silicon \emph{supply} and environmental \emph{context} through a 5-layer progressive lowering architecture, enforces SI unit correctness at runtime via the \texttt{pint} library, and produces full-stack analysis in under 0.3 seconds on any laptop. Designed as the analytical companion to the \emph{Machine Learning Systems} textbook~\citep{mlsysbook2025}, \mlsysim enables students at resource-constrained institutions to engage with the same quantitative exercises as those at well-funded research universities.
|
|
|
|
This paper makes the following contributions:
|
|
|
|
\begin{enumerate}[leftmargin=*,itemsep=3pt,label=\textbf{C\arabic*.}]
|
|
\item \textbf{A Taxonomy of 22 Systems Walls} (\Cref{tab:walls}), each grounded in a published equation and resolved by a dedicated resolver (model or solver). The taxonomy provides a complete, structured vocabulary for reasoning about the constraints that bind ML system performance (\Cref{sec:taxonomy}).
|
|
|
|
\item \textbf{Demand--Supply Separation with Dimensional Strictness.} A 5-layer progressive lowering abstraction formally decouples computational demand from silicon supply. Every physical quantity carries SI units at runtime, transforming dimensional analysis from a manual discipline into a machine-checked invariant (\Cref{sec:architecture}).
|
|
|
|
\item \textbf{Composable Resolver Algebra.} 25 resolvers (20 models, 2 solvers, and 3 optimizers) compose through chaining: each is a pure function $f(\text{config}) \to \text{metrics}$, producing a three-level evaluation (Feasibility, Performance, Macro) that identifies binding constraints. The algebra includes an inverse-Roofline \emph{synthesis} solver that derives minimum hardware specifications from SLA requirements (\Cref{sec:solver-formalism,sec:usage}).
|
|
|
|
\item \textbf{Accessible Full-Stack Reasoning without Hardware.} \mlsysim runs on any laptop without GPUs, clusters, or cloud credits. It powers fully autogradable, deterministic labs and compiles directly to WebAssembly via Marimo notebooks, providing an interactive, browser-based systems engineering environment (\Cref{sec:usage}).
|
|
\end{enumerate}
|
|
|
|
The paper builds the framework in layers. We first survey the modeling landscape and identify the void (\Cref{sec:related}), then present the architecture: three design principles (\Cref{sec:philosophy}) that materialize as a 5-layer progressive lowering stack (\Cref{sec:stack}) whose type system enables the 22-wall taxonomy (\Cref{sec:taxonomy}), which in turn defines the solver algebra (\Cref{sec:solver-formalism}). We validate against six published benchmarks spanning five domains (\Cref{sec:validation}), demonstrate the framework through student, instructor, and researcher use cases (\Cref{sec:usage}), surface common misconceptions the framework is designed to expose (\Cref{sec:fallacies}), and discuss limitations and future work (\Cref{sec:discussion}).
|
|
|
|
\begin{table*}[!t]
|
|
\centering
|
|
\caption{\textbf{The 22 ML Systems Walls.} Each wall represents a physical or logical constraint resolved by a dedicated resolver (model or solver). Walls 1--2 (Compute and Memory) share the \texttt{SingleNodeModel}. Together with 3 optimizers (parallelism, batching, placement), the framework provides 25 resolvers across 22 walls. Domains progress from local node resources through data movement and algorithmic scaling to fleet coordination, operations, and cross-cutting analysis. Each wall is formalized in \Cref{sec:taxonomy}.}
|
|
\label{tab:walls}
|
|
\small
|
|
\renewcommand{\arraystretch}{1.1}
|
|
\begin{tabularx}{\textwidth}{@{}r l l l X l@{}}
|
|
\toprule
|
|
\textbf{\#} & \textbf{Wall} & \textbf{Resolver} & \textbf{Bounded} & \textbf{Core Equation} & \textbf{Ref.} \\
|
|
\midrule
|
|
\multicolumn{6}{@{}l}{\textit{Node (Single-Accelerator Resources)}} \\
|
|
1 & Compute & SingleNode & Peak FLOP/s & $T = \text{OPs} / (\text{Peak} \times \eta)$ & \citealt{williams2009roofline} \\
|
|
2 & Memory & SingleNode & HBM BW + cap. & $T = |W| / BW_{\text{HBM}}$ & \citealt{williams2009roofline} \\
|
|
3 & Software & Efficiency & Achieved MFU & $\eta = \text{FLOPS}_{\text{ach}} / \text{Peak}$ & \citealt{chowdhery2022palm} \\
|
|
4 & Serving & Serving & Prefill vs.\ dec. & $T_{\text{pf}} = 2PS/(F\eta);\; T_{\text{dec}} = |W|/BW$ & \citealt{pope2023llm} \\
|
|
5 & Batching & Cont.\ Batch & KV-cache frag. & $\text{KV} = 2LHD \lceil S/p \rceil pBb$ & \citealt{kwon2023efficient} \\
|
|
6 & Streaming & WeightStream & Injection BW & $T = \max(|W_\ell|/BW, 2P_\ell B/F\eta)$ & \citealt{lie2022cerebras} \\
|
|
7 & Tail Latency & TailLatency & P99 queueing & Erlang-C M/M/$c$ & \citealt{dean2013tail} \\
|
|
\midrule
|
|
\multicolumn{6}{@{}l}{\textit{Data (Movement \& Pipelines)}} \\
|
|
8 & Ingestion & Data & Storage I/O & $\rho = BW_{\text{demand}} / BW_{\text{supply}}$ & \citealt{mohan2021analyzing} \\
|
|
9 & Transform. & Transform. & CPU preproc. & $T = B / R_{\text{cpu}}$ & \citealt{murray2021tf} \\
|
|
10 & Locality & Topology & Bisection BW & $BW_{\text{eff}} = BW_{\text{link}} \cdot \beta / \text{osub}$ & \citealt{leiserson1985fat} \\
|
|
\midrule
|
|
\multicolumn{6}{@{}l}{\textit{Algorithm (Scaling \& Compression)}} \\
|
|
11 & Complexity & Scaling & Scaling laws & $C = 6PD;\; P^{*} = \sqrt{C/120}$ & \citealt{hoffmann2022chinchilla} \\
|
|
12 & Reasoning & Inf.\ Scaling & Inf.-time comp. & $T = K \times T_{\text{step}}$ & \citealt{snell2024scaling} \\
|
|
13 & Fidelity & Compression & Acc.--efficiency & $r = b_{\text{base}}/b;\; r = 1/(1{-}s)$ & \citealt{han2016deep} \\
|
|
\midrule
|
|
\multicolumn{6}{@{}l}{\textit{Fleet (Multi-Node Coordination)}} \\
|
|
14 & Communic. & Distributed & AllReduce & $T = 2\tfrac{N{-}1}{N}\tfrac{M}{B_{\text{link}}} + 2(N{-}1)\alpha$ & \citealt{shoeybi2019megatron} \\
|
|
15 & Fragility & Reliability & Cluster MTBF & $\text{MTBF}_{\text{cl}} = \text{MTBF}_{\text{node}}/N$ & \citealt{daly2006higher} \\
|
|
16 & Multi-tenant & Orchestration & Queue wait & $T_{\text{wait}} = \rho / [2\mu(1{-}\rho)]$ & \citealt{little1961proof} \\
|
|
\midrule
|
|
\multicolumn{6}{@{}l}{\textit{Operations (Economics, Sustainability \& Safety)}} \\
|
|
17 & Capital & Economics & TCO & $\text{TCO} = \text{CapEx} + \text{OpEx}$ & \citealt{barroso2018datacenter} \\
|
|
18 & Sustain. & Sustainability & Carbon + water & $\text{CO}_2 = E \times \text{PUE} \times \text{CI}$ & \citealt{patterson2021carbon} \\
|
|
19 & Checkpoint & Checkpoint & I/O burst penalty & $\text{penalty} = T_{\text{write}} / T_{\text{interval}}$ & \citealt{eisenman2022checknrun} \\
|
|
20 & Safety & Resp.\ Eng. & DP-SGD overhead & $\sigma \propto 1/\varepsilon$ & \citealt{abadi2016deep} \\
|
|
\midrule
|
|
\multicolumn{6}{@{}l}{\textit{Analysis (Cross-Cutting Diagnostics)}} \\
|
|
21 & Sensitivity & Sensitivity & Binding constr. & $\partial T / \partial x_i$ & \citealt{williams2009roofline} \\
|
|
22 & Synthesis & Synthesis & Inverse spec & $BW_{\text{req}} = |W| / T_{\text{target}}$ & \citealt{kwon2023efficient} \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\end{table*}
|
|
|
|
\section{Related Work}
|
|
\label{sec:related}
|
|
|
|
Tools for modeling and evaluating ML systems span a wide spectrum of fidelity, scope, and intended audience. We organize prior work into four categories and position \mlsysim relative to each. \Cref{tab:comparison} provides a quantitative summary.
|
|
|
|
\begin{table*}[!t]
|
|
\centering
|
|
\caption{\textbf{Comparison of ML systems modeling tools.} Scope indicates the range of system aspects modeled. Speed reflects wall-clock time for a single evaluation. Walls indicates the number of the 22 systems walls (\Cref{tab:walls}) each tool addresses. Phase indicates whether training, inference, or both are modeled. Dist.\ indicates support for multi-node distributed analysis.}
|
|
\label{tab:comparison}
|
|
\small
|
|
\renewcommand{\arraystretch}{1.1}
|
|
\begin{tabularx}{\textwidth}{@{}l l X c c c c@{}}
|
|
\toprule
|
|
\textbf{Tool} & \textbf{Approach} & \textbf{Scope} & \textbf{Speed} & \textbf{Walls} & \textbf{Phase} & \textbf{Dist.} \\
|
|
\midrule
|
|
\multicolumn{7}{@{}l}{\textit{Cycle-level}} \\
|
|
gem5 & Cycle-accurate & CPU/GPU microarchitecture & Hours & 1--2 & Both & \emptymark \\
|
|
ASTRA-sim 2.0 & Cycle-accurate & Network collectives, topology & Hours & 1--2 & Train & \fullmark \\
|
|
SimAI & Trace-driven & Full-stack distributed training & Minutes & 2--3 & Train & \fullmark \\
|
|
\midrule
|
|
\multicolumn{7}{@{}l}{\textit{Accelerator design}} \\
|
|
Timeloop + Accelergy & Analytical & Accelerator dataflow \& energy & Minutes & 1--2 & Infer & \emptymark \\
|
|
LLMCompass & Analytical & LLM inference HW design space & Minutes & 2--3 & Infer & \emptymark \\
|
|
\midrule
|
|
\multicolumn{7}{@{}l}{\textit{Analytical \& co-design}} \\
|
|
Paleo & Analytical & DNN training compute \& comm. & Seconds & 2 & Train & \fullmark \\
|
|
Calculon & Analytical & LLM training performance & Seconds & 2--3 & Train & \fullmark \\
|
|
Lumos & Trace-driven & LLM training perf.\ modeling & Seconds & 2--3 & Train & \fullmark \\
|
|
Vidur & Empirical & LLM inference scheduling & Seconds & 3--4 & Infer & \emptymark \\
|
|
GenZ & Analytical & LLM inference platform design & Seconds & 2--3 & Infer & \emptymark \\
|
|
LLM-Viewer & Analytical & LLM inference memory/latency & Seconds & 1--2 & Infer & \emptymark \\
|
|
\midrule
|
|
\multicolumn{7}{@{}l}{\textit{Sustainability}} \\
|
|
LLMCarbon & Analytical & LLM carbon footprint (op.\ + embodied) & Seconds & 1 & Both & \emptymark \\
|
|
CodeCarbon & Empirical & Runtime energy \& carbon tracking & Seconds & 1 & Both & \emptymark \\
|
|
\midrule
|
|
\MLSYSIM & \textbf{Analytical} & \textbf{Full-stack: compute, memory, network, data,} & \textbf{Sub-sec.} & \textbf{22} & \textbf{Both} & \fullmark \\
|
|
& & \textbf{scaling, reliability, econ., sustainability, safety} & & & & \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\end{table*}
|
|
|
|
|
|
\subsection{Cycle-Level Simulators}
|
|
|
|
Cycle-accurate simulators provide the highest fidelity by modeling hardware behavior at the microarchitectural level. gem5~\citep{binkert2011gem5} is the canonical general-purpose architecture simulator, capable of modeling CPUs and GPUs down to individual pipeline stages. While invaluable for processor design, gem5 lacks ML-specific abstractions (it has no notion of a transformer layer, a training step, or a parallelism strategy), and simulating even a single forward pass of a modern model can require hours of wall-clock time.
|
|
|
|
ASTRA-sim 2.0~\citep{won2023astrasim2} addresses the ML gap by providing a hierarchical network simulator purpose-built for distributed training. It models collective communication patterns such as AllReduce across realistic network topologies, producing high-fidelity estimates of communication overhead. SimAI~\citep{wang2025simai} extends this approach with a full-stack training simulator that integrates NS3-based network modeling with kernel computation traces, achieving 98\% alignment with real-world results on 1024-node A100 clusters. Both inherit the fundamental cost of high fidelity: simulating one training step of a large model at cluster scale can take minutes to hours, making iterative design-space exploration impractical. Their scope is also limited to communication and compute; neither models economics, sustainability, data pipelines, or reliability.
|
|
|
|
\mlsysim occupies a different point on the fidelity--speed spectrum. Where ASTRA-sim answers ``how many microseconds does this AllReduce take on this exact topology,'' \mlsysim answers ``which of 22 possible bottlenecks binds this system, and how does changing hardware shift the binding constraint?'' The two classes are complementary: \mlsysim narrows the design space, and high-fidelity simulators validate specific points within it.
|
|
|
|
\subsection{Accelerator Design Tools}
|
|
|
|
A second class of tools targets the design and evaluation of individual accelerator architectures. Timeloop~\citep{parashar2019timeloop} provides a systematic methodology for evaluating DNN accelerator dataflows, modeling how data tiles map onto spatial architectures and estimating latency and energy for each mapping. Accelergy~\citep{wu2019accelergy}, its companion framework, supplies the energy estimation primitives that Timeloop consumes. Together, they form a powerful toolkit for accelerator architects exploring the design space of novel silicon. LLMCompass~\citep{zhang2024llmcompass} brings this approach to LLM inference, combining an automated mapper with an area-based cost model to explore compute, memory bandwidth, and buffer configurations, achieving 4\% error for end-to-end LLM inference on A100 nodes within minutes.
|
|
|
|
These tools operate at the operator and tile level, modeling how a single convolution or matrix multiplication executes on a specific microarchitecture. They do not reason about system-level concerns: how multiple accelerators communicate across a network fabric, how the data pipeline feeds those accelerators, or what the total cost of ownership looks like at fleet scale. \mlsysim operates one abstraction level higher. It consumes the \emph{outputs} of accelerator-level analysis (peak FLOP/s, memory bandwidth, TDP) as inputs to its hardware registry and reasons about how those specifications interact with workload demands, network topologies, and infrastructure constraints.
|
|
|
|
\subsection{Analytical and Co-Design Tools}
|
|
|
|
Closest in spirit to \mlsysim are analytical tools that sacrifice microarchitectural detail for speed. Calculon~\citep{calculon2023} is an analytical co-design tool for large language model training. It models training time as a function of hardware specifications, parallelism strategies, and model architecture, achieving execution speeds comparable to \mlsysim. However, Calculon's scope is narrow by design, targeting transformer-based LLM training exclusively, with no support for CNNs, mixture-of-experts architectures, or inference workloads. It does not model data pipelines, reliability, sustainability, economics, or safety considerations, and it lacks dimensional enforcement. Lumos~\citep{liang2025lumos} takes a trace-driven approach, using profiled kernel traces to predict LLM training performance at scale with 3.3\% average error on up to 512 H100 GPUs. While Lumos achieves higher single-point accuracy than \mlsysim, it requires empirical traces from the target hardware, limiting its use in what-if exploration across hypothetical configurations. The DeepSeek-V3 systems paper~\citep{deepseek2025v3} exemplifies the kind of hardware-aware co-design analysis that \mlsysim targets: it demonstrates how FP8 mixed-precision training, MoE sparsity, and multi-plane network topology interact to achieve frontier-model training at a fraction of conventional cost, a multi-wall optimization that spans Walls 1, 3, 13, 14, and 17 in our taxonomy.
|
|
|
|
Paleo~\citep{qi2017paleo} pioneered the analytical approach, decomposing DNN training time into computation and communication components across data- and model-parallel configurations. FlexFlow~\citep{jia2019flexflow} optimizes parallelism strategies through simulation-guided search, and Habitat~\citep{yu2021habitat} provides cross-hardware extrapolation of training performance using execution-time scaling curves. While these tools advance specific aspects of analytical modeling, they predate or do not address the full scope of modern concerns (inference serving, fleet economics, sustainability). Vidur~\citep{agrawal2024vidur} extends analytical modeling to LLM \emph{inference}, using operator-level profiling to build a fine-grained runtime estimator validated at less than 5\% error across multiple LLMs and scheduling policies. GenZ~\citep{bambhaniya2024genz} provides an analytical framework for LLM inference platform design that models multi-dimensional network topologies and serving optimizations. DistServe~\citep{zhong2024distserve} and Sarathi-Serve~\citep{agrawal2024sarathi} advance LLM serving through prefill-decode disaggregation and chunked-prefill scheduling respectively, demonstrating that the two-phase inference model (\Cref{eq:serving}) requires increasingly sophisticated scheduling to achieve high goodput. All of these tools focus on inference performance in isolation; they do not model training, data pipelines, economics, or sustainability.
|
|
|
|
LLM-Viewer~\citep{yuan2024llmviewer} and llm-analysis~\citep{kim2023llmanalysis} provide lightweight memory and latency estimation for transformer inference. These tools are useful for single-model profiling but do not extend to fleet-level reasoning, multi-tenant scheduling, or cross-domain constraint analysis.
|
|
|
|
A parallel line of work targets sustainability and fleet efficiency. LLMCarbon~\citep{faiz2024llmcarbon} projects end-to-end carbon footprints (operational and embodied) for dense and MoE LLMs, validated within 8\% of Google's published figures. CodeCarbon~\citep{lottick2019codecarbon} provides empirical energy tracking at runtime via hardware power monitors. \citet{wongpanich2025fleet} introduce \emph{ML Productivity Goodput} (MPG) as a fleet-level efficiency metric for warehouse-scale TPU clusters, demonstrating that traditional utilization metrics are insufficient for characterizing ML fleet performance across model, data, framework, compiler, and scheduling layers. These tools each address one or two domains and do not model the full cross-stack interactions that determine \emph{why} a workload consumes so much energy.
|
|
|
|
\mlsysim generalizes the analytical approach to the full ML systems stack. Where each tool above models one or two domains, \mlsysim composes 25 resolvers spanning compute, memory, serving, batching, streaming, tail latency, network, data pipelines, scaling laws, compression, reliability, checkpointing, economics, sustainability, and responsible engineering. Crucially, \mlsysim enforces dimensional strictness at runtime via the \texttt{pint} library, transforming unit consistency from a manual discipline into a machine-checked invariant.
|
|
|
|
\subsection{Pedagogical Precedents}
|
|
|
|
\mlsysim draws direct inspiration from the tradition of pedagogical simulators in systems education. Patterson and Hennessy's MIPS/SPIM simulator~\citep{patterson2014organization} taught generations of students computer architecture not by replicating a production processor, but by providing a simplified model that made architectural concepts tangible through rapid experimentation. Similarly, xv6~\citep{cox2011xv6} and MINIX~\citep{tanenbaum2006minix} teach operating systems by stripping away production complexity to reveal core abstractions.
|
|
|
|
\mlsysim follows this pedagogical philosophy for the ML systems domain. Production ML infrastructure (spanning millions of lines of code across frameworks, compilers, schedulers, and orchestrators) is too complex for students to reason about directly. \mlsysim provides a controlled environment in which students can sweep hardware configurations, vary parallelism strategies, and observe how binding constraints shift, all in rapid iteration cycles. Its integration with the companion textbook~\citep{mlsysbook2025} provides structured laboratory exercises with autogradable assessments, a capability absent from every research-oriented tool surveyed in \Cref{sec:related}.
|
|
|
|
\begin{figure*}[!t]
|
|
\centering
|
|
\includegraphics[width=\textwidth]{figures/mlsysim-overview.pdf}
|
|
\caption{\textbf{\mlsysim Framework Overview.} (a)~Demand--supply separation decouples workload specifications from hardware capabilities and environmental context. (b)~All 22 Systems Walls organized into six domains (Node, Data, Algorithm, Fleet, Operations, Analysis), each grounded in a published equation. (c)~Stateless resolver composition chains 25 resolvers to identify binding constraints through a three-level evaluation. (d)~Example outputs for LLaMA-3 70B on 1024$\times$ H100, produced in $<$0.3\,s on a laptop.}
|
|
\label{fig:overview}
|
|
\end{figure*}
|
|
|
|
% ============================================================
|
|
\section{Architecture}
|
|
\label{sec:architecture}
|
|
|
|
\Cref{fig:overview} presents the end-to-end framework. This section unpacks the design in two parts: four principles that constrain every subsequent decision (\Cref{sec:philosophy}), and the five-layer progressive lowering stack with a dimensionally strict type system that realizes them (\Cref{sec:stack}).
|
|
|
|
\subsection{Design Philosophy}
|
|
\label{sec:philosophy}
|
|
|
|
The central question \mlsysim addresses is: \emph{Where does the complexity of an ML system come from?} We argue that complexity arises from the non-linear interaction of constraints across six distinct domains (node resources, data movement, algorithms, fleet coordination, operations, and cross-cutting analysis) and that an effective modeling tool must formalize \emph{all six} simultaneously. Four design principles govern \mlsysim's approach.
|
|
|
|
\subsubsection{Analytical Speed over Cycle Accuracy}
|
|
\label{sec:speed}
|
|
|
|
Analytical models that execute in sub-second time enable iterative design-space exploration that cycle-accurate simulators cannot support at comparable speed. By absorbing microarchitectural detail (cache hit rates, warp scheduling) into a single efficiency parameter~$\eta$, \mlsysim achieves sub-second execution per solve. This enables sweeps over thousands of hardware--model--topology combinations in seconds---the kind of rapid ``what-if'' exploration that \citet{hennessy2024architecture} identify as essential for quantitative architectural reasoning.
|
|
|
|
To maintain rigor despite analytical simplification, \mlsysim enforces a \textbf{``No Magic Numbers'' invariant}. Every hardware constant references a datasheet URL and verification date. The H100's peak FP16 throughput is not a bare \texttt{989} floating-point literal; it is \texttt{989\,*\,TFLOPs\,/\,second}, sourced from NVIDIA's published datasheet~\citep{nvidia2023h100}. This provenance discipline ensures that analytical speed does not come at the cost of reproducibility.
|
|
|
|
\subsubsection{Dimensional Strictness as an Invariant}
|
|
\label{sec:dimensional}
|
|
|
|
Dimensional consistency is a pervasive challenge in systems modeling. Mixing gigabits with gigabytes, or omitting the refractive index of fiber in latency calculations, are canonical failure modes that silently corrupt results. \mlsysim treats dimensional correctness not as a feature but as a \textbf{runtime invariant}, analogous to memory safety in Rust. The framework wraps every physical quantity using the \texttt{pint} unit library. If a user attempts to add FLOP/s to GB/s, the framework raises a deterministic \texttt{DimensionalityError} before any computation proceeds:
|
|
|
|
\begin{lstlisting}[caption={\textbf{Dimensional Strictness.} Prevents silent unit errors at the API level.},label={lst:units}]
|
|
from mlsysim.core.constants import Q_
|
|
rate = Q_("989 TFLOPs/s") # Compute throughput
|
|
bw = Q_("3.35 TB/s") # Memory bandwidth
|
|
rate + bw # -> DimensionalityError
|
|
(rate / bw).to("flop/byte") # -> 295 flop/byte (ridge point)
|
|
\end{lstlisting}
|
|
|
|
This design eliminates the single most common class of bugs in back-of-the-envelope systems analysis: silent unit conversion errors.\footnote{The most infamous unit-conversion failure is the Mars Climate Orbiter, lost in 1999 because ground software produced thrust data in pound-force seconds while the spacecraft expected newton seconds~\citep{stephenson1999mco}.} Every constant in \texttt{constants.py} carries dimensioned units through all downstream computations (e.g., \texttt{H100\_MEM\_BW = 3.35\,*\,TB/s}), preventing unit mismatches at the type level.
|
|
|
|
\subsubsection{Taxonomic Completeness}
|
|
\label{sec:taxonomic}
|
|
|
|
We define a modeling framework as ``complete'' only when every fundamental bottleneck to scaling has a mathematical resolver. \mlsysim codifies 22 such bottlenecks, which we call \emph{Systems Walls}, organized into six domains: Node (single-accelerator resources), Data (movement and pipelines), Algorithm (scaling and compression), Fleet (multi-node coordination), Operations (economics, sustainability, and safety), and Analysis (cross-cutting diagnostics). Each wall maps to a dedicated resolver with a formal equation grounded in the systems literature. \Cref{sec:taxonomy} presents the full taxonomy with formal definitions, and \Cref{sec:solver-formalism} describes how solvers compose.
|
|
|
|
\subsubsection{Demand--Supply Separation}
|
|
\label{sec:demand-supply}
|
|
|
|
\mlsysim enforces a strict separation between \emph{what} a model computes and \emph{where} it runs. A \texttt{TransformerWorkload} describes computational demand (parameters, layers, FLOPs, arithmetic intensity) without reference to any specific accelerator. A \texttt{HardwareNode} describes physical supply (peak throughput, memory bandwidth, TDP) without reference to any specific model. This decoupling, inspired by the compiler IR philosophy of progressive lowering~\citep{hennessy2024architecture}, enables hardware--software co-design: the same GPT-3 workload can be evaluated against an H100, a TPU~v5p, or a hypothetical future accelerator in a single parametric sweep, with all dimensional conversions handled automatically.
|
|
|
|
\subsection{The Progressive Lowering Stack}
|
|
\label{sec:stack}
|
|
|
|
\mlsysim implements these design principles through a five-layer \emph{Progressive Lowering} stack (\Cref{fig:stack}). Layers A--D are independent input layers that describe demand, supply, context, and topology respectively; they do not depend on one another. Layer~E (Resolvers) consumes any combination of layers A--D as needed---a single-node analysis requires only A+B, while a fleet-wide carbon estimate draws on A+B+C+D.
|
|
|
|
\begin{figure*}[!t]
|
|
\centering
|
|
\includegraphics[width=\textwidth]{images/pdf/architecture-stack.pdf}
|
|
\caption{\textbf{The \mlsysim 5-Layer Architecture.} Layers A--D provide typed inputs: workload demand, hardware supply, infrastructure context, and network topology. Layer~A lowers through \texttt{lower()} to a hardware-agnostic Computation Graph. Layer~E's 25 stateless resolvers consume these inputs and produce a three-level SystemEvaluation scorecard. The red arrow marks the binding constraint identified by the solver chain.}
|
|
\label{fig:stack}
|
|
\end{figure*}
|
|
|
|
\textbf{Layer A: Workloads (Demand).} A workload is a hardware-agnostic description of computational demand. \mlsysim provides five concrete workload types (\Cref{tab:workloads}), each exposing a \texttt{lower()} method that produces a \texttt{ComputationGraph}: an intermediate representation containing total operations, weight bytes, and arithmetic intensity in \texttt{flop/byte}. This IR is the contract between demand and supply: it captures \emph{what} must be computed without prescribing \emph{how}.
|
|
|
|
\begin{table}[!t]
|
|
\centering
|
|
\caption{\textbf{Supported Workload Types.} Each workload lowers to a \texttt{ComputationGraph} with total FLOPs, weight bytes, and arithmetic intensity.}
|
|
\label{tab:workloads}
|
|
\small
|
|
\renewcommand{\arraystretch}{1.1}
|
|
\begin{tabularx}{\columnwidth}{@{}l X l@{}}
|
|
\toprule
|
|
\textbf{Workload} & \textbf{Key Parameters} & \textbf{Scaling} \\
|
|
\midrule
|
|
Transformer & $P$, $L$, $H$, $D$, seq.\ length & $2P$ FLOPs/token \\
|
|
CNN & $P$, inference FLOPs & Fixed per image \\
|
|
Sparse (MoE) & Total vs.\ active $P$, experts & Active $P$ for FLOPs \\
|
|
SSM (Mamba) & $P$, state dim, $D$ & $O(1)$ state cache \\
|
|
Diffusion & $P$, denoising steps $T$ & $T \times$ FLOPs/step \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\end{table}
|
|
|
|
\textbf{Layer B: Hardware (Supply).} A \texttt{HardwareNode} composes four subsystems: \texttt{ComputeCore} (peak FLOP/s with a precision-keyed dictionary for FP16, TF32, FP8, INT8), \texttt{MemoryHierarchy} (capacity and bandwidth), optional \texttt{StorageHierarchy}, and optional \texttt{IOInterconnect}. Each node also carries TDP, unit cost, and a kernel dispatch tax. The \texttt{ridge\_point()} method computes the Roofline inflection $R = F_{\text{peak}} / BW_{\text{mem}}$ in \texttt{flop/byte}~\citep{williams2009roofline}, enabling immediate classification of any lowered workload as compute-bound or memory-bound.
|
|
|
|
\textbf{Layer C: Infrastructure (Context).} \texttt{GridProfile} objects encode regional environmental parameters: carbon intensity (gCO$_2$/kWh), Power Usage Effectiveness (PUE), and Water Usage Effectiveness (WUE). A \texttt{Datacenter} composes a grid profile with rack-level power density constraints. This layer converts raw energy consumption into carbon footprint and water usage, following the methodology of \citet{patterson2021carbon}.
|
|
|
|
\textbf{Layer D: Systems (Topology).} A \texttt{Fleet} composes \texttt{Node}s (accelerator type, count per node, intra-node bandwidth) with a \texttt{NetworkFabric} (topology, inter-node bandwidth, latency, oversubscription ratio).\footnote{Throughout this paper, \emph{node} refers to a physical compute node---a server containing one or more accelerators connected via PCIe or NVLink---not a vertex in a graph or network topology.} This layer enables distributed analysis: the \texttt{DistributedModel} decomposes workloads using 4D parallelism (data-parallel~DP $\times$ tensor-parallel~TP $\times$ pipeline-parallel~PP $\times$ expert-parallel~EP) and calculates hierarchical AllReduce costs, pipeline bubble fractions, and scaling efficiency.
|
|
|
|
\textbf{Layer E: Resolvers (Analysis).} Twenty-five stateless resolvers (20 models, 2 solvers, and 3 optimizers) consume demand and supply to produce dimensioned performance metrics. The solver formalism (stateless composition, chaining semantics, and three-level evaluation) is detailed in \Cref{sec:solver-formalism}.
|
|
|
|
Reproducible analysis requires reproducible inputs. \mlsysim organizes vetted specifications into four curated registries collectively called the \emph{MLSys Zoo}: the \textbf{Silicon Zoo} (hardware accelerators), the \textbf{Model Zoo} (workload architectures), the \textbf{Fleet Zoo} (cluster topologies), and the \textbf{Infrastructure Zoo} (regional grid profiles). Each entry is a fully typed object whose constants are sourced from datasheets with provenance metadata including source URLs and verification dates. For example, \texttt{Hardware.Cloud.H100} returns a \texttt{HardwareNode} with all physical quantities dimensioned via \texttt{pint}. The Silicon Zoo spans six orders of magnitude from sub-milliwatt microcontrollers (\texttt{Hardware.Tiny.ESP32\_S3}, 512\,KiB SRAM) to wafer-scale engines (\texttt{Hardware.Cloud.CerebrasCS3}, 44\,GB on-wafer SRAM, 125\,PFLOP/s~\citep{lie2022cerebras}). The \texttt{list(sort\_by=)} class method enables programmatic comparison, and users extend any zoo by instantiating new typed objects (\Cref{lst:hwnode}), ensuring custom entries participate in the same dimensionally strict pipeline as vetted ones.
|
|
|
|
\subsection{The Type System}
|
|
\label{sec:types}
|
|
|
|
\mlsysim's type system is built on Pydantic \texttt{BaseModel} classes with \texttt{pint} \texttt{Quantity} fields, providing both schema validation and dimensional enforcement at construction time. The composition hierarchy is deliberately shallow: \texttt{HardwareNode} aggregates \texttt{ComputeCore} and \texttt{MemoryHierarchy} as direct fields, not through deep inheritance. This design makes the relationship between a hardware specification and its physical quantities immediately legible:
|
|
|
|
\begin{lstlisting}[caption={\textbf{Custom Hardware Node.} Composing a hardware specification with dimensional types.},label={lst:hwnode}]
|
|
from mlsysim.hardware.types import *
|
|
from mlsysim.core.constants import Q_
|
|
node = HardwareNode(
|
|
name="Custom Accelerator",
|
|
release_year=2025,
|
|
compute=ComputeCore(
|
|
peak_flops=Q_("500 TFLOPs/s"),
|
|
precision_flops={"fp8": Q_("1000 TFLOPs/s")}),
|
|
memory=MemoryHierarchy(
|
|
capacity=Q_("96 GiB"),
|
|
bandwidth=Q_("4 TB/s")),
|
|
tdp=Q_("500 W"))
|
|
print(node.ridge_point()) # -> 125 flop/byte
|
|
\end{lstlisting}
|
|
|
|
The \texttt{ComputationGraph} IR bridges the demand--supply gap. When a solver calls \texttt{workload.lower()}, the workload computes its total operations, weight bytes, and arithmetic intensity, all in dimensioned quantities. For Mixture-of-Experts models, \texttt{SparseTransformerWorkload.lower()} uses \emph{active} parameters for FLOPs but \emph{total} parameters for memory footprint, correctly modeling the fundamental decoupling between compute cost and capacity requirements in sparse architectures~\citep{shazeer2017outrageously}.
|
|
|
|
The complete evaluation produces a \texttt{SystemEvaluation} scorecard, a single object containing every metric from every resolver, cross-referenced by wall number. Students can inspect any individual wall or view the aggregate to understand how constraints interact across the full stack.
|
|
|
|
\subsection{Extensibility}
|
|
\label{sec:extensibility}
|
|
|
|
The layered architecture is designed for extension at every level. New workload types (e.g., a \texttt{RetrievalAugmentedWorkload} for RAG pipelines) require only implementing the \texttt{lower()} method to produce a \texttt{ComputationGraph}; all existing resolvers then apply without modification. New hardware entries are added to the Silicon Zoo as declarative \texttt{HardwareNode} specifications (\Cref{lst:hwnode}), with no resolver changes needed. New resolvers can be introduced for emerging constraints by implementing the appropriate tier interface: a Tier 1 Model (e.g., a \texttt{PrivacyModel} for federated learning overhead), a Tier 2 Solver, or a Tier 3 Optimizer. By accepting typed inputs and returning dimensioned outputs, the type system enforces correctness at every boundary, ensuring that custom extensions compose safely with existing components. This design ensures that \mlsysim can track the rapidly evolving ML systems landscape without requiring architectural changes to the core framework.
|
|
|
|
% ============================================================
|
|
\section{Taxonomy of ML Systems Walls}
|
|
\label{sec:taxonomy}
|
|
|
|
\begin{table}[!t]
|
|
\centering
|
|
\caption{\textbf{Notation.} Symbols used throughout; all quantities carry SI units at runtime via \texttt{pint}.}
|
|
\label{tab:notation}
|
|
\footnotesize
|
|
\setlength{\tabcolsep}{4pt}
|
|
\renewcommand{\arraystretch}{1.05}
|
|
\begin{tabularx}{\columnwidth}{@{}l l X@{}}
|
|
\toprule
|
|
\textbf{Symbol} & \textbf{Unit} & \textbf{Description} \\
|
|
\midrule
|
|
\multicolumn{3}{@{}l}{\textit{Model \& Workload}} \\
|
|
$P$, $P_{\ell}$ & params & Total / per-layer parameter count \\
|
|
$|W|$, $|W_{\ell}|$ & bytes & Bytes read per step / per layer \\
|
|
$b_{\text{prec}}$ & B/param & Precision (e.g., 2 for FP16) \\
|
|
$L$, $H$, $D$ & -- & Layers, attention heads, head dim \\
|
|
$S$ & tokens & Sequence length \\
|
|
$B$ & samples & Batch size \\
|
|
$K$ & -- & Reasoning steps \\
|
|
$C$ & FLOPs & Training compute ($6PD$) \\
|
|
$I$ & FLOP/B & Arithmetic intensity \\
|
|
\midrule
|
|
\multicolumn{3}{@{}l}{\textit{Hardware \& Infrastructure}} \\
|
|
$\text{Peak}_{\text{FLOPS}}$ & FLOP/s & Peak accelerator throughput \\
|
|
$BW_{\text{HBM}}$ & B/s & HBM bandwidth \\
|
|
$BW_{\text{inject}}$ & B/s & Injection BW (wafer-scale) \\
|
|
$BW_{\text{link}}$ & B/s & Per-link network bandwidth \\
|
|
$N$, $G$ & -- & Nodes in fleet, GPUs per node \\
|
|
$\alpha$ & s & Per-hop network latency \\
|
|
\midrule
|
|
\multicolumn{3}{@{}l}{\textit{Efficiency \& Utilization}} \\
|
|
$\eta$ & -- & HW utilization ($\approx$ MFU) \\
|
|
$\eta_{\text{overlap}}$ & -- & Compute--comm overlap \\
|
|
$\rho$ & -- & Utilization ratio (queue/data) \\
|
|
$\beta$, $\beta_{\text{opt}}$ & -- & Bisection BW frac, optimizer multiplier \\
|
|
\midrule
|
|
\multicolumn{3}{@{}l}{\textit{Parallelism}} \\
|
|
TP, PP, DP, EP & -- & Tensor, pipeline, data, expert parallel \\
|
|
$V$, $M_{\text{micro}}$ & -- & Virtual stages, microbatches \\
|
|
\midrule
|
|
\multicolumn{3}{@{}l}{\textit{Sustainability}} \\
|
|
PUE, WUE & --, L/kWh & Power / Water Usage Effectiveness \\
|
|
$\text{CI}$ & gCO$_2$/kWh & Regional carbon intensity \\
|
|
\midrule
|
|
\multicolumn{3}{@{}l}{\textit{Key Derived Quantities}} \\
|
|
$I^{*}$ & FLOP/B & Ridge point ($\text{Peak}/BW_{\text{HBM}}$) \\
|
|
$B^{*}$, $P^{*}$, $D^{*}$ & varies & Optimal batch, model, dataset size \\
|
|
$\tau_{\text{opt}}$ & s & Optimal checkpoint interval \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\end{table}
|
|
|
|
The ML systems literature is rich with specialized models for specific bottlenecks, from the original Roofline model~\citep{williams2009roofline} and Chinchilla scaling laws~\citep{hoffmann2022chinchilla} to PagedAttention batching limits~\citep{kwon2023efficient} and datacenter sustainability accounting~\citep{patterson2021carbon}. However, these constraints are typically studied in isolation. We synthesize this disjointed literature into a unified taxonomy of 20 distinct ``Walls'' plus 2 cross-cutting diagnostic tools (Sensitivity and Synthesis). We borrow the term from the computer architecture tradition---the ``memory wall''~\citep{williams2009roofline}, the ``power wall''~\citep{hennessy2024architecture}---where a \emph{wall} denotes a hard physical or logical constraint that bounds system performance and cannot be circumvented by software optimization alone. Codifying these previously disparate equations into a single, composable framework, each wall is resolved by a dedicated resolver (model or solver) that accepts typed inputs and produces dimensionally correct bounds. This holistic integration of theoretical and empirical constraints into a unified executable engine has not been done before. The complete taxonomy is summarized in \Cref{tab:walls}; the subsections that follow formalize each domain.
|
|
|
|
\subsection{Node (Single-Accelerator Resources)}
|
|
\label{sec:walls-node}
|
|
|
|
The Node walls define what a single accelerator can achieve in isolation. They are the innermost constraints and the first a practitioner should evaluate.
|
|
|
|
\textbf{Wall~1: The Compute Wall.} Every accelerator has a hard throughput ceiling determined by the number of arithmetic units and the clock frequency. An H100, for example, provides a peak of 989\,TFLOP/s at FP16 with Tensor Cores, establishing an upper bound that no software optimization can exceed. The \texttt{SingleNodeModel} resolves this wall via Roofline analysis~\citep{williams2009roofline}:
|
|
\begin{equation}
|
|
\label{eq:tcompute}
|
|
T_{\text{compute}} = \frac{\text{OPs}}{\text{Peak}_{\text{FLOPS}} \times \eta}
|
|
\end{equation}
|
|
where $\eta \in (0,1]$ is the hardware utilization efficiency and OPs is the total operation count. Throughout this paper, $\eta$ denotes the ratio of sustained to peak throughput; the related metric \emph{Model FLOPS Utilization} (MFU) measures only model-useful FLOPs and excludes overhead such as activation recomputation. For first-order analysis, we treat $\eta \approx \text{MFU}$; the distinction matters only when recomputation or non-model compute is significant. When this wall binds, the only remedy is faster silicon or fewer operations. \textbf{Assumptions:} Peak FLOPS is a hard ceiling; $\eta$ is workload-dependent and must be specified or estimated from benchmark data.
|
|
|
|
\textbf{Wall~2: The Memory Wall.} High-bandwidth memory (HBM) imposes two ceilings: capacity (the model must fit) and bandwidth (weights must stream to compute units fast enough). An H100 reads HBM at 3.35\,TB/s, yet its 989\,TFLOP/s demand data at a rate that exceeds this bandwidth for any workload below ${\sim}295$\,flop/byte, making most LLM inference memory-bound, not compute-bound. During training, techniques like \textbf{Low-Rank Adaptation (LoRA)} and \textbf{Activation Recomputation} fundamentally alter the capacity constraint by trading compute or parameter trainability for drastically reduced memory footprints. The same \texttt{SingleNodeModel} computes~\citep{williams2009roofline}:
|
|
\begin{equation}
|
|
\label{eq:tmemory}
|
|
T_{\text{memory}} = \frac{|W|}{BW_{\text{HBM}}}
|
|
\end{equation}
|
|
where $|W|$ is the total bytes read per inference step. The realized execution time is the maximum of the two bounds:
|
|
\begin{equation}
|
|
\label{eq:bottleneck}
|
|
T = \max(T_{\text{compute}},\; T_{\text{memory}})
|
|
\end{equation}
|
|
The crossover between compute-bound and memory-bound regimes occurs at the \emph{ridge point}, the arithmetic intensity at which the two ceilings intersect:
|
|
\begin{equation}
|
|
\label{eq:ridge}
|
|
I^{*} = \frac{\text{Peak}_{\text{FLOPS}}}{BW_{\text{HBM}}} \quad \text{(flop/byte)}
|
|
\end{equation}
|
|
Workloads with arithmetic intensity $I < I^{*}$ are memory-bound; those with $I > I^{*}$ are compute-bound. \textbf{Assumptions:} Peak FLOPS and HBM bandwidth are hard ceilings; MFU accounts for software inefficiency via $\eta$.
|
|
|
|
\textbf{Wall~3: The Software Wall.} The gap between peak and achieved FLOP/s is typically larger than the gap between hardware generations. Most na\"ive implementations achieve only 30\% of peak throughput; the remaining 70\% is lost to redundant memory traffic, low warp occupancy, and unfused operations. The \texttt{EfficiencyModel} models this as a multiplicative efficiency factor~\citep{chowdhery2022palm}:
|
|
\begin{equation}
|
|
\label{eq:efficiency}
|
|
\eta = \frac{\text{FLOPS}_{\text{achieved}}}{\text{Peak}_{\text{FLOPS}}}
|
|
\end{equation}
|
|
where $\eta \in (0,1]$ modulates the Roofline ceiling, reducing the effective peak from $\text{Peak}_{\text{FLOPS}}$ to $\eta \times \text{Peak}_{\text{FLOPS}}$. FlashAttention~\citep{dao2022flashattention}, for example, achieves a $2.5\times$ speedup over standard attention by fusing memory-bound operations into a single kernel pass, effectively raising $\eta$ from ${\sim}0.3$ to ${\sim}0.75$ for attention layers. When this wall binds, better kernels, not bigger chips, are the remedy. \textbf{Assumption:} $\eta$ is a single scalar that aggregates all software inefficiencies; in practice, different operations (GEMM vs.\ attention vs.\ normalization) achieve different utilization on the same silicon. To avoid circularity when $\eta$ is unknown, the \texttt{EfficiencyModel} provides default ranges derived from published benchmarks: $\eta \approx 0.30$--$0.45$ for large-scale training~\citep{chowdhery2022palm,llama3team2024}, $\eta \approx 0.50$--$0.60$ for highly optimized GEMM-heavy workloads, and $\eta < 0.10$ for memory-bound inference decode. Students can use these defaults as starting points and refine as they gather profiling data.
|
|
|
|
\textbf{Wall~4: The Serving Wall.} Autoregressive LLM inference exhibits two distinct phases with fundamentally different Roofline characteristics~\citep{pope2023llm}. The \texttt{ServingModel} decomposes end-to-end inference latency as:
|
|
\begin{align}
|
|
\label{eq:serving}
|
|
T_{\text{prefill}} &= \frac{2P \cdot S_{\text{in}}}{\text{Peak}_{\text{FLOPS}} \times \eta} \quad \text{(compute-bound)} \\
|
|
T_{\text{decode}} &= \frac{|W|}{BW_{\text{HBM}}} \quad \text{(memory-bound)}
|
|
\end{align}
|
|
where $S_{\text{in}}$ is the input sequence length, $P$ is the parameter count, and $|W|$ includes both model weights and KV-cache reads (which grow with batch size and context length: $|W| = |W_{\text{model}}| + |W_{\text{KV}}|$). The prefill phase processes all input tokens in parallel and is compute-bound; the decode phase generates one token at a time and is memory-bandwidth-bound. The solver incorporates modern paradigms including \textbf{Prompt Caching} (prefix caching~\citep{zheng2024sglang}, which reduces TTFT by skipping prefill for previously computed KV-cache entries), \textbf{Speculative Decoding} (probability-weighted verification using a smaller draft model~\citep{leviathan2023fast}), and \textbf{Disaggregated Serving} (phase splitting onto different hardware with KV-cache network transfer~\citep{patel2024splitwise}). This duality explains why batching strategies that improve prefill throughput may have no effect on decode latency, because the two phases are bound by different resources. \textbf{Assumptions:} Prefill is compute-bound for sequence lengths $S_{\text{in}} \gg 1$; decode is memory-bandwidth-bound at batch size 1. At large batch sizes, decode transitions toward compute-bound; the solver models this crossover via the Roofline.
|
|
|
|
\textbf{Wall~5: The Batching Wall.} Static batching wastes memory through external fragmentation: each request reserves a contiguous KV-cache block sized for maximum sequence length, even if most requests finish early. The \texttt{ContinuousBatchingModel} models iteration-level scheduling with non-contiguous allocation via PagedAttention~\citep{kwon2023efficient}:
|
|
\begin{equation}
|
|
\label{eq:pagedkv}
|
|
\text{KV}_{\text{paged}} = 2 \times L \times H \times D \times \lceil S / p \rceil \times p \times B \times b
|
|
\end{equation}
|
|
where $L$ is layers, $H$ is KV heads, $D$ is head dimension, $S$ is sequence length, $p$ is page size in tokens, $B$ is batch size, and $b$ is bytes per element. Internal fragmentation is bounded by the last page, eliminating the 40--50\% external fragmentation of contiguous allocation. \textbf{Assumptions:} Decode is memory-bandwidth-bound for batch $\geq 1$; static batching baseline assumes 50\% fragmentation waste.
|
|
|
|
\textbf{Wall~6: The Streaming Wall.} Wafer-scale architectures~\citep{lie2022cerebras} (e.g., Cerebras CS-3) invert the conventional memory hierarchy: activations reside on-wafer in SRAM while model weights stream from external MemoryX nodes, shifting the bottleneck from HBM bandwidth to injection interconnect bandwidth. The \texttt{WeightStreamingModel} models this as:
|
|
\begin{equation}
|
|
\label{eq:weightstream}
|
|
T_{\text{layer}} = \max\!\left(\frac{|W_{\ell}|}{BW_{\text{inject}}},\; \frac{2P_{\ell} \times B}{\text{Peak} \times \eta}\right)
|
|
\end{equation}
|
|
where $|W_{\ell}|$ is the layer weight size in bytes, $P_{\ell} = |W_{\ell}| / b_{\text{prec}}$ is the parameter count (with $b_{\text{prec}}$ bytes per element), and the factor of 2 accounts for the multiply-accumulate FLOPs per parameter. The two terms inside the $\max$ represent the injection time (weight delivery) and the compute time (matrix arithmetic) for one layer. When $B$ is small, weight injection dominates and the compute engine sits idle; when $B$ is large, compute dominates and the injection link sits idle. Setting the two terms equal and substituting $|W_{\ell}| = P_{\ell} \times b_{\text{prec}}$ yields the optimal batch size $B^{*} = (b_{\text{prec}} \times \text{Peak} \times \eta) / (2 \times BW_{\text{inject}})$---a result that depends only on numerical precision, peak compute, and injection bandwidth, independent of layer size. This is the unique operating point where injection and compute perfectly overlap, maximizing utilization of both resources. \textbf{Assumptions:} Layer weights dominate the injection payload; 10\% overhead is reserved for working memory; perfect within-layer pipelining is assumed.
|
|
|
|
\textbf{Wall~7: The Tail Latency Wall.} At scale, P99 latency governs user experience, not the median. A single slow replica in a fan-out of 100 services dominates end-to-end response time. The \texttt{TailLatencyModel} models inference replicas as an M/M/$c$ queue using the Erlang-C formula~\citep{dean2013tail}:
|
|
\begin{equation}
|
|
\label{eq:erlangc}
|
|
\mathbb{P}[\text{wait}] = \frac{(c\rho)^c / c! \cdot (1-\rho)^{-1}}{\sum_{k=0}^{c-1}(c\rho)^k/k! + (c\rho)^c/c! \cdot (1-\rho)^{-1}}
|
|
\end{equation}
|
|
where $c$ is the number of replicas, $\rho = \lambda / (c\mu)$ is per-server utilization, and $\lambda$, $\mu$ are arrival and service rates. P99 latency grows non-linearly as $\rho \to 1$, making the distinction between 80\% and 95\% utilization the difference between stable and catastrophic tail behavior. \textbf{Assumption:} The M/M/$c$ model is intentionally optimistic; real traffic is bursty and service times are heavy-tailed, so M/M/$c$ provides a lower bound on tail latency. If the system is unstable under M/M/$c$, it will certainly be unstable under real traffic.
|
|
|
|
\subsection{Data (Movement \& Pipelines)}
|
|
\label{sec:walls-data}
|
|
|
|
The Data walls govern how data moves \emph{to} the accelerator. A node that is locally unconstrained can still starve if the data pipeline cannot keep pace.
|
|
|
|
\textbf{Wall~8: The Ingestion Wall.} Storage I/O must supply training samples at the rate the accelerator consumes them. The \texttt{DataModel} computes the demand--supply ratio~\citep{mohan2021analyzing}:
|
|
\begin{equation}
|
|
\label{eq:ingestion}
|
|
\rho_{\text{data}} = \frac{BW_{\text{demand}}}{BW_{\text{supply}}}
|
|
\end{equation}
|
|
When $\rho_{\text{data}} > 1$, the accelerator stalls waiting for data. $BW_{\text{demand}}$ is the product of batch size, sample size, and step rate; $BW_{\text{supply}}$ is the effective throughput of the storage subsystem after accounting for read amplification and caching. \textbf{Assumption:} Storage bandwidth is the bottleneck, not network I/O (single-node training).
|
|
|
|
\textbf{Wall~9: The Transformation Wall.} JPEG decoding, tokenization, and augmentation execute on CPU cores, not on the accelerator. When the CPU preprocessing pipeline cannot keep pace, the accelerator stalls even if storage bandwidth is abundant. The \texttt{TransformationModel} quantifies this bottleneck~\citep{murray2021tf}:
|
|
\begin{equation}
|
|
\label{eq:transform}
|
|
T_{\text{transform}} = \frac{B}{R_{\text{cpu}}}
|
|
\end{equation}
|
|
where $B$ is the batch size (in samples) and $R_{\text{cpu}}$ is the aggregate CPU preprocessing rate (in samples/s), computed as $R_{\text{cpu}} = N_{\text{workers}} \times r_{\text{worker}}$ where $r_{\text{worker}}$ is the per-worker throughput after all transformations (decode, augment, normalize). \textbf{Assumption:} Preprocessing is CPU-bound and scales linearly with worker count up to core saturation.
|
|
|
|
\textbf{Wall~10: The Locality Wall.} Network topology determines the effective bandwidth available between any two nodes in the cluster. The \texttt{TopologyModel} models this through the \emph{bisection bandwidth fraction} $\beta$~\citep{leiserson1985fat}, which varies by topology: Fat-Tree provides full bisection ($\beta = 1.0$), Dragonfly achieves $\beta \approx 0.85$, and 3D~Torus yields $\beta \approx 0.67$. The effective inter-node bandwidth is:
|
|
\begin{equation}
|
|
\label{eq:locality}
|
|
BW_{\text{eff}} = \frac{BW_{\text{link}} \times \beta}{\text{oversubscription}}
|
|
\end{equation}
|
|
This wall becomes binding when collective communication patterns demand bandwidth that the topology cannot supply at scale. We place the Locality Wall in the Data domain rather than Fleet because it models the \emph{physical topology constraint} on data movement (bisection bandwidth, oversubscription), whereas the Communication Wall (Wall~14, Fleet) models the \emph{algorithmic cost} of specific collectives (AllReduce, All-to-All) that run atop that topology. The two walls interact but address distinct levels of abstraction. \textbf{Assumption:} $\beta$ values are topology-specific constants; real networks may exhibit dynamic congestion not captured by this static model.
|
|
|
|
\subsection{Algorithm (Scaling \& Compression)}
|
|
\label{sec:walls-algorithm}
|
|
|
|
The Algorithm walls arise not from hardware but from the mathematics of learning itself. They determine how much computation a workload \emph{requires}, independent of the silicon that executes it.
|
|
|
|
\textbf{Wall~11: The Complexity Wall.} Chinchilla scaling laws~\citep{hoffmann2022chinchilla} establish that training compute scales jointly with model size $P$ and dataset size $D$: doubling the parameters requires approximately doubling the tokens to remain compute-optimal. The \texttt{ScalingModel} implements:
|
|
\begin{align}
|
|
\label{eq:chinchilla}
|
|
C &= 6PD \quad \text{(total training FLOPs)} \\
|
|
D^{*} &\approx 20P \quad \text{(compute-optimal tokens)} \\
|
|
P^{*} &= \sqrt{\frac{C}{120}} \quad \text{(optimal model size for budget } C\text{)}
|
|
\end{align}
|
|
These relations allow students to reason backward from a compute budget to the largest model that can be trained optimally, or forward from a model size to the minimum viable training cluster. \textbf{Assumption:} Scaling law coefficients are fitted to published training runs; extrapolation beyond the fitted range is flagged.
|
|
|
|
\textbf{Wall~12: The Reasoning Wall.} Inference-time compute scaling introduces a cost that grows linearly with the number of reasoning steps $K$. The \texttt{InferenceScalingModel} models this as~\citep{snell2024scaling}:
|
|
\begin{equation}
|
|
\label{eq:reasoning}
|
|
T_{\text{reason}} = K \times T_{\text{step}}(P, S_{\text{context}})
|
|
\end{equation}
|
|
where $T_{\text{step}}$ is the per-step latency, itself a function of model size $P$ and context length $S_{\text{context}}$. Chain-of-thought and tree-search strategies can increase $K$ by $10$--$100\times$ relative to single-pass inference~\citep{snell2024scaling}, fundamentally altering serving cost economics. \textbf{Assumption:} Each reasoning step is an independent decode sequence; KV-cache is not shared across steps.
|
|
|
|
\textbf{Wall~13: The Fidelity Wall.} Compression trades model fidelity for efficiency: quantization reduces precision while pruning removes weights entirely. The accuracy--efficiency frontier is task- and architecture-dependent. The \texttt{CompressionModel} quantifies the two primary mechanisms~\citep{han2016deep,gholami2021survey}:
|
|
\begin{align}
|
|
\label{eq:compression}
|
|
r_{\text{quant}} &= \frac{b_{\text{base}}}{b_{\text{target}}} \quad \text{(quantization ratio, e.g., } 16/4 = 4{\times}\text{)} \\
|
|
r_{\text{prune}} &= \frac{1}{1 - s} \quad \text{(memory reduction at sparsity } s\text{)}
|
|
\end{align}
|
|
Here $b_{\text{base}}$ is the baseline precision (typically 16 for FP16/BF16 models, not 32) and $b_{\text{target}}$ is the quantized precision. Critically, quantization reduces memory reads but not FLOPs, shifting the arithmetic intensity rightward on the Roofline by a factor of $r_{\text{quant}}$ and potentially crossing the ridge point from memory-bound to compute-bound. For pruning, $r_{\text{prune}}$ gives the \emph{memory reduction} ratio; actual compute speedup depends on sparsity structure. Unstructured sparsity yields no acceleration on current GPUs, while 2:4 structured sparsity~\citep{nvidia2023h100} provides a $2{\times}$ throughput gain via Sparse Tensor Cores. Post-training quantization methods such as GPTQ~\citep{frantar2023gptq} and AWQ~\citep{lin2024awq} demonstrate that 4-bit quantization can preserve most accuracy for large language models, making $r_{\text{quant}} = 4{\times}$ a practical operating point. The accuracy impact $\Delta_{\text{acc}}$ is modeled as a configurable function, since the fidelity--compression frontier varies by architecture and task. \textbf{Assumptions:} Accuracy degradation follows empirical curves from~\citet{gholami2021survey}; pruning compute speedup requires structured sparsity with hardware support.
|
|
|
|
\subsection{Fleet (Multi-Node Coordination)}
|
|
\label{sec:walls-fleet}
|
|
|
|
The Fleet walls arise when systems scale beyond a single node, requiring coordination across multiple accelerators connected by network fabric.
|
|
|
|
\textbf{Wall~14: The Communication Wall.} Distributed training requires gradient synchronization across $N$ nodes, and the cost of that synchronization grows with both message size and node count. The \texttt{DistributedModel} models the dominant collective operations. For Ring AllReduce~\citep{shoeybi2019megatron} and its \textbf{ZeRO/FSDP} partitioned equivalents (Reduce-Scatter and All-Gather):
|
|
\begin{equation}
|
|
\label{eq:allreduce}
|
|
T_{\text{ring}} = \frac{2(N-1)}{N} \cdot \frac{M}{B_{\text{link}}} + 2(N-1) \cdot \alpha
|
|
\end{equation}
|
|
where $M$ is the message size, $B_{\text{link}}$ is the per-link bandwidth, and $\alpha$ is the per-hop latency. Modern clusters use \textbf{hierarchical AllReduce} to exploit the bandwidth asymmetry between intra-node interconnect (e.g., NVLink at 900\,GB/s) and inter-node fabric (e.g., InfiniBand at 50\,GB/s per port, an $18{\times}$ gap). The \texttt{DistributedModel} implements a two-level model:
|
|
\begin{equation}
|
|
\label{eq:hierarchical}
|
|
T_{\text{hier}} = T_{\text{intra}}(BW_{\text{NVLink}},\, G) + T_{\text{inter}}(BW_{\text{IB}},\, N)
|
|
\end{equation}
|
|
where $G$ is GPUs per node and $N$ is the node count. Each level applies the ring formula (\Cref{eq:allreduce}) at its respective bandwidth, making the inter-node phase the dominant cost at scale. In practice, modern frameworks also utilize \textbf{Compute/Communication Overlap}, hiding network latency behind backward pass computation. The \texttt{DistributedModel} models this with an overlap efficiency parameter $\eta_{\text{overlap}} \in [0,1]$ (default 0.85, reflecting typical Megatron-LM behavior), yielding an exposed communication cost of $(1 - \eta_{\text{overlap}}) \cdot T_{\text{comm}}$. Pipeline parallelism~\citep{narayanan2021efficient} introduces a bubble overhead:
|
|
\begin{equation}
|
|
\label{eq:bubble}
|
|
B_{\text{pipeline}} = \frac{P_{\text{stages}} - 1}{V \cdot M_{\text{micro}} + P_{\text{stages}} - 1}
|
|
\end{equation}
|
|
where $V$ is the number of virtual pipeline stages and $M_{\text{micro}}$ is the number of microbatches. For Mixture-of-Experts All-to-All dispatch:
|
|
\begin{equation}
|
|
\label{eq:alltoall}
|
|
T_{\text{a2a}} = \frac{(N-1)}{N} \cdot \frac{M}{B_{\text{link}}} + (N-1) \cdot \alpha
|
|
\end{equation}
|
|
|
|
\textbf{Wall~15: The Fragility Wall.} Component failures are inevitable at scale. If each node has a mean time between failures of $\text{MTBF}_{\text{node}}$, then a cluster of $N$ nodes has~\citep{daly2006higher}:
|
|
\begin{equation}
|
|
\label{eq:mtbf}
|
|
\text{MTBF}_{\text{cluster}} = \frac{\text{MTBF}_{\text{node}}}{N}
|
|
\end{equation}
|
|
The probability of at least one failure during a training run of duration $T$ is:
|
|
\begin{equation}
|
|
\label{eq:pfail}
|
|
P(\geq 1 \text{ failure}) = 1 - e^{-T / \text{MTBF}_{\text{cluster}}}
|
|
\end{equation}
|
|
The Young-Daly formula~\citep{young1974first,daly2006higher} gives the optimal checkpoint interval:
|
|
\begin{equation}
|
|
\label{eq:youngdaly}
|
|
\tau_{\text{opt}} = \sqrt{2 \delta \cdot \text{MTBF}_{\text{cluster}}}
|
|
\end{equation}
|
|
where $\delta$ is the time to write one checkpoint. The \texttt{ReliabilityModel} uses these relations to estimate the fraction of compute lost to checkpointing and recovery. \textbf{Assumption:} Failures are independent and exponentially distributed (memoryless).
|
|
|
|
\textbf{Wall~16: The Multi-tenant Wall.} Shared clusters introduce queueing delays that grow hyperbolically as utilization approaches 1.0. The \texttt{OrchestrationModel} models job wait times using an M/D/1 queue~\citep{little1961proof}:
|
|
\begin{equation}
|
|
\label{eq:queue}
|
|
T_{\text{wait}} = \frac{\rho}{2\mu(1 - \rho)}
|
|
\end{equation}
|
|
where $\rho = \lambda / \mu$ is the cluster utilization, $\lambda$ is the job arrival rate, and $\mu$ is the service rate. As $\rho \to 1$, wait times diverge hyperbolically, a non-linear relationship that makes the distinction between 80\% and 95\% utilization qualitatively significant. \textbf{Assumption:} Job durations are approximately deterministic, which is reasonable for large training runs with predictable step times.
|
|
|
|
\subsection{Operations (Cost, Carbon \& Safety)}
|
|
\label{sec:walls-operations}
|
|
|
|
The Operations walls capture constraints that are not about \emph{how fast} a system runs but \emph{whether it should run at all}: economic viability, environmental impact, checkpoint overhead, and responsible deployment.
|
|
|
|
\textbf{Wall~17: The Capital Wall.} Performance analysis is incomplete without economic constraints. The \texttt{EconomicsModel} computes total cost of ownership~\citep{barroso2018datacenter}:
|
|
\begin{equation}
|
|
\label{eq:tco}
|
|
\text{TCO} = \text{CapEx} + \text{OpEx}_{\text{energy}} + \text{OpEx}_{\text{maint}}
|
|
\end{equation}
|
|
where $\text{OpEx}_{\text{energy}} = E_{\text{total}} \times P_{\text{kWh}}$ converts total energy consumption to dollar cost at the regional electricity price. \textbf{Assumption:} Linear amortization over a 3--5 year hardware lifetime.
|
|
|
|
\textbf{Wall~18: The Sustainability Wall.} The same training run can emit up to 40$\times$ more CO$_2$ depending on regional grid carbon intensity (e.g., Iowa circa 2020 at 680\,gCO$_2$/kWh vs.\ Qu\'ebec at 17\,gCO$_2$/kWh). The \texttt{SustainabilityModel} converts energy into environmental impact~\citep{patterson2021carbon}:
|
|
\begin{align}
|
|
\label{eq:sustainability}
|
|
E_{\text{total}} &= E_{\text{IT}} \times \text{PUE} \\
|
|
\text{CO}_2 &= E_{\text{total}} \times \text{CI}_{\text{region}} \quad \text{(gCO}_2\text{/kWh)} \\
|
|
\text{H}_2\text{O} &= E_{\text{total}} \times \text{WUE} \quad \text{(L/kWh)}
|
|
\end{align}
|
|
where PUE is the power usage effectiveness of the datacenter, $\text{CI}_{\text{region}}$ is the carbon intensity of the local grid, and WUE is the water usage effectiveness. \textbf{Assumption:} Grid carbon intensity is a static regional constant; temporal variation (e.g., renewable intermittency) is not modeled. Energy-proportional power follows~\citet{barroso2007case}: idle power is 30\% of TDP, with the remaining 70\% scaling linearly with MFU.
|
|
|
|
\textbf{Wall~19: The Checkpoint Wall.} Long-running training jobs must periodically save model state (weights and optimizer states) to persistent storage, incurring an I/O penalty that directly reduces effective MFU. The \texttt{CheckpointModel} models the I/O burst penalty~\citep{eisenman2022checknrun}:
|
|
\begin{equation}
|
|
\label{eq:checkpoint2}
|
|
\text{MFU}_{\text{penalty}} = \frac{T_{\text{write}}}{T_{\text{interval}}} = \frac{|W| \times \beta_{\text{opt}} / BW_{\text{storage}}}{T_{\text{ckpt\_interval}}}
|
|
\end{equation}
|
|
where $|W|$ is the model weight size in bytes, and $\beta_{\text{opt}}$ is the optimizer state multiplier---the ratio of total checkpoint bytes to model weight bytes. For mixed-precision Adam with FP16 weights, $\beta_{\text{opt}} \approx 7$ (FP32 master weights at 4\,bytes + FP32 momentum at 4\,bytes + FP32 variance at 4\,bytes, plus FP16 model weights at 2\,bytes, totaling 14\,bytes per parameter vs.\ 2\,bytes for the FP16 model alone). Gradients are ephemeral and not checkpointed. For a 70B-parameter model, the checkpoint size is $70\text{B} \times 14\text{\,B/param} \approx 0.98$\,TB, making storage bandwidth the binding constraint during I/O bursts.
|
|
|
|
\textbf{Wall~20: The Safety Wall.} Privacy and fairness guarantees impose quantifiable computational overhead. The \texttt{ResponsibleEngineeringModel} models the cost of differential privacy via DP-SGD~\citep{abadi2016deep}, where the noise multiplier scales inversely with the privacy budget:
|
|
\begin{equation}
|
|
\label{eq:dpsgd}
|
|
\sigma \propto \frac{1}{\varepsilon}
|
|
\end{equation}
|
|
The per-step clipping and noise addition incur a training slowdown of approximately $2$--$10\times$. Fairness constraints require sufficient representation of minority subgroups, demanding additional data proportional to $O(1/p_{\min})$ where $p_{\min}$ is the smallest subgroup prevalence. \textbf{Assumption:} Privacy budget $\varepsilon$ is a hard constraint; the solver reports the compute multiplier, not the privacy guarantee.
|
|
|
|
\subsection{Analysis (Cross-Cutting Diagnostics)}
|
|
\label{sec:walls-analysis}
|
|
|
|
The preceding 20 walls each model a specific physical or logical constraint. The final two entries are \emph{diagnostic tools} rather than walls in the strict sense: they operate \emph{across} the taxonomy rather than within a single domain, providing analysis capabilities that span all walls. The Sensitivity tool identifies which wall is binding, and the Synthesis tool derives minimum hardware from SLA requirements.
|
|
|
|
\textbf{Wall~21: The Sensitivity Wall.} Optimization is effective only when directed at the binding constraint; improving a non-bottleneck parameter yields no measurable gain. The \texttt{SensitivitySolver} identifies the binding constraint by computing partial derivatives of end-to-end latency with respect to each hardware parameter~\citep{williams2009roofline}:
|
|
\begin{equation}
|
|
\label{eq:sensitivity}
|
|
\frac{\partial T}{\partial BW_{\text{mem}}}, \quad \frac{\partial T}{\partial \text{Peak}_{\text{FLOPS}}}, \quad \frac{\partial T}{\partial BW_{\text{net}}}, \quad \ldots
|
|
\end{equation}
|
|
The parameter with the largest sensitivity is the binding constraint, that is, the single upgrade that would yield the greatest performance improvement. This transforms ``where should I invest?'' from intuition into calculation. \textbf{Assumption:} Finite-difference approximation with 1\% perturbation; second-order effects are ignored. The binding constraint is identified as the parameter with the largest normalized gradient $|\partial T / \partial x_i| \cdot (x_i / T)$.
|
|
|
|
\textbf{Wall~22: The Synthesis Wall.} The \texttt{SynthesisSolver} addresses the inverse problem: given a service-level objective (e.g., 50\,ms inter-token latency), it derives the minimum hardware specifications required to satisfy it~\citep{kwon2023efficient}:
|
|
\begin{align}
|
|
\label{eq:synthesis}
|
|
BW_{\text{required}} &= \frac{|W|}{T_{\text{target}}} \\
|
|
\text{FLOPS}_{\text{required}} &= \frac{\text{OPs}}{T_{\text{target}} \times \eta}
|
|
\end{align}
|
|
This enables hardware-software co-design: engineers specify an SLA and the solver derives the minimum hardware that satisfies it. \textbf{Assumption:} Hardware parameters are independently adjustable; co-design coupling between FLOPS and bandwidth is not modeled.
|
|
|
|
\section{The 3-Tier Resolver Architecture}
|
|
\label{sec:solver-formalism}
|
|
|
|
The walls define \emph{what} constrains a system. This section formalizes \emph{how} resolvers compose to produce end-to-end system evaluations. To clarify the mathematical intent of each component, \mlsysim organizes its analytical tools into a strict 3-tier taxonomy: \textbf{Models} (evaluate), \textbf{Solvers} (diagnose), and \textbf{Optimizers} (search).
|
|
|
|
\subsection{Tier 1: Analytical Models}
|
|
Analytical models act as the ``physics engine.'' They perform forward evaluation ($Y = f(X)$) to determine the physical and logical consequences of a specific system configuration. For example, the \texttt{ServingModel} calculates the exact time-to-first-token for a given LLM and GPU pair. Models are purely deterministic and make no decisions; they comprise the first 20 resolvers in our taxonomy.
|
|
|
|
\subsection{Tier 2: Analysis Solvers}
|
|
Analysis solvers act as the ``math engine.'' They perform algebraic inversion or calculus ($X = f^{-1}(Y)$ or $\nabla f$) to find the exact parameter required to hit a specific target. For example, the \texttt{SynthesisSolver} takes a target latency SLA and works backward to derive the minimum memory bandwidth required.
|
|
|
|
\subsection{Tier 3: Optimizers}
|
|
Optimizers act as the ``engineering engine.'' They perform constrained design-space search ($\max f(X) \text{ s.t. } g(X) \le c$) to find the best configuration among many valid options. Unlike Models and Solvers, which map directly to individual walls, Optimizers operate across the entire taxonomy to navigate complex constraint spaces. For example, the \texttt{ParallelismOptimizer} sweeps all valid 3D tensor/pipeline/data parallel splits to maximize Model FLOPs Utilization (MFU) on a given cluster, while the \texttt{BatchingOptimizer} searches for the maximum batch size that satisfies a P99 queueing latency SLA.
|
|
|
|
\subsection{Stateless Composition and Chaining}
|
|
\label{sec:solvers-compose}
|
|
|
|
Every resolver in \mlsysim is a pure function: it accepts a typed configuration, performs analytical computation, and returns a typed result object (a Pydantic \texttt{BaseModel} with dimensioned fields). Resolvers maintain no hidden state between invocations. Because they share a common type system, the output of one tier feeds naturally into the next. A full-stack analysis composes resolvers in sequence to resolve complex design questions. For example, determining the financial cost of training an optimally-sized model on a frontier cluster requires chaining algorithmic scaling, distributed execution, and macro-economics:
|
|
|
|
\begin{equation}
|
|
\label{eq:chain}
|
|
\mathsf{Scaling} \xrightarrow{\;\mathcal{R}_1\;} \mathsf{Distributed} \xrightarrow{\;\mathcal{R}_2\;} \mathsf{Economics} \xrightarrow{\;\mathcal{R}_3\;} \mathsf{Sustainability}
|
|
\end{equation}
|
|
|
|
\Cref{lst:composability} demonstrates this exact chain in \mlsysim. The \texttt{ScalingModel} calculates the optimal model size for a given compute budget ($\mathcal{R}_1$). The \texttt{DistributedModel} takes that workload and computes the real-world execution time on an 8,192-GPU fleet, factoring in 3D parallelism overhead ($\mathcal{R}_2$). Finally, the \texttt{EconomicsModel} converts that execution time into a Total Cost of Ownership ($\mathcal{R}_3$).
|
|
|
|
\begin{lstlisting}[caption={\textbf{Solver Composition.} Bridging algorithmic scaling, distributed execution, and fleet economics in a single executable chain.},label={lst:composability},float=t]
|
|
import mlsysim
|
|
from mlsysim import ScalingModel, DistributedModel, EconomicsModel
|
|
|
|
# 1. Algorithm: Find optimal parameters for a fixed compute budget
|
|
budget = mlsysim.Q_("4e24 FLOP") # ~100K H100-days at 50% MFU
|
|
optimal = ScalingModel().solve(compute_budget=budget)
|
|
|
|
# Instantiate the demand (Layer A: Workload)
|
|
model = mlsysim.TransformerWorkload(
|
|
name="Frontier-Model",
|
|
parameters=optimal.optimal_parameters,
|
|
layers=80, hidden_dim=8192, heads=64
|
|
)
|
|
|
|
# 2. Fleet: Evaluate on a massive 8K GPU cluster (Layer D: Supply/Topology)
|
|
fleet = mlsysim.Systems.Clusters.Frontier_8K
|
|
perf = DistributedModel().solve(
|
|
model, fleet,
|
|
batch_size=4096, tp_size=8, pp_size=4
|
|
)
|
|
|
|
# 3. The Capital: Calculate TCO for the resulting training time
|
|
duration_days = perf.step_latency_total * optimal.optimal_tokens / 4096
|
|
tco = EconomicsModel().solve(fleet, duration_days=duration_days.to('day').magnitude)
|
|
|
|
print(f"Scaling Efficiency: {perf.scaling_efficiency:.1%}")
|
|
print(f"Total Job Cost: ${tco.tco_usd:,.2f}")
|
|
\end{lstlisting}
|
|
|
|
Each link in the chain preserves dimensional correctness: units propagate through the computation, and any mismatch raises an immediate error rather than producing a silently wrong result.
|
|
|
|
\begin{figure*}[!t]
|
|
\centering
|
|
\includegraphics[trim=0 50 0 0, clip, width=0.95\textwidth]{images/pdf/solver-chaining.pdf}
|
|
\caption{\textbf{Resolver Composition.} Three input layers feed four resolvers. Each resolver is a pure function: typed inputs in, dimensionally correct outputs out. The scorecard aggregates three evaluation levels: Feasibility, Performance, and Macro (economics, sustainability, safety).}
|
|
\label{fig:solver-chaining}
|
|
\end{figure*}
|
|
|
|
\subsection{The SystemEvaluation Scorecard}
|
|
\label{sec:solvers-scorecard}
|
|
|
|
\mlsysim provides a \texttt{Scenario.evaluate()} entry point that orchestrates solver composition automatically through a three-level evaluation:
|
|
|
|
\textbf{Level~1: Feasibility.} Does the model fit? Can the data pipeline keep pace? The framework checks memory capacity against model size, ingestion bandwidth against training throughput, and reports any wall where demand exceeds supply.
|
|
|
|
\textbf{Level~2: Performance.} What are the achievable latency, throughput, and utilization? The Roofline analysis (\Cref{eq:bottleneck}), communication modeling (\Cref{eq:allreduce}), and pipeline bubble (\Cref{eq:bubble}) combine to produce end-to-end training step time.
|
|
|
|
\textbf{Level~3: Macro.} What does it cost, and what does it emit? TCO (\Cref{eq:tco}), carbon (\Cref{eq:sustainability}), and responsibility overhead (\Cref{eq:dpsgd}) are computed from the performance results.
|
|
|
|
The three levels are evaluated in order; a feasibility failure at Level~1 short-circuits the evaluation and reports the binding constraint. This ordering reflects the dependency structure: communication optimization is irrelevant if the model exceeds available memory. The complete implementation details and key assumptions for each solver are documented alongside their respective walls in \Cref{sec:taxonomy}.
|
|
|
|
\section{Validation}
|
|
\label{sec:validation}
|
|
|
|
An analytical framework earns trust through transparent confrontation with empirical ground truth. We validate \mlsysim along two axes: \emph{accuracy} against published benchmarks, and \emph{speed} relative to alternative modeling tools.
|
|
|
|
\subsection{Empirical Anchors}
|
|
|
|
We anchor \mlsysim predictions against seven published benchmarks spanning single-node training, distributed training, inference, scaling laws, sustainability, and automated design-space optimization.
|
|
|
|
\textbf{Anchor~1: MLPerf ResNet-50 on DGX A100 (Single-Node Training).}
|
|
For ResNet-50 training on a DGX A100 node (8$\times$ A100 GPUs with NVLink) at batch size 2048, \mlsysim predicts a throughput of approximately 37{,}000 samples/s using the \texttt{SingleNodeModel} with hardware utilization $\eta = 0.49$ and 8-way data parallelism within the node. The MLPerf Training v4.0 NVIDIA closed-division submission reports 38{,}200 samples/s for this 8-GPU configuration~\citep{mlperf2020}, yielding a prediction error of 3.1\%. Per-GPU throughput is $\sim$4{,}750 samples/s, consistent with the A100's Roofline ceiling for ResNet-50's arithmetic intensity. This validates the Roofline-based throughput model~\citep{williams2009roofline} at the core of \mlsysim's single-node solver.
|
|
|
|
\textbf{Anchor~2: vLLM Llama-2-70B on H100 (Inference).}
|
|
For autoregressive decoding of Llama-2-70B (FP16, batch size 1), the model weights total 140\,GB, requiring at minimum two tensor-parallel H100s (each with 80\,GB HBM3). \mlsysim's first-order estimate divides total weights by aggregate bandwidth: $140\;\text{GB} / (2 \times 3.35\;\text{TB/s}) = 20.9$\,ms for the weight-read phase alone. Adding KV-cache reads, attention computation, NVLink synchronization, and framework scheduling overhead, the predicted end-to-end ITL is approximately 42\,ms~\citep{nvidia2023h100}. Published vLLM benchmarks for this configuration report ITL values in the 40--50\,ms range~\citep{kwon2023efficient}, confirming that decode-phase LLM inference is memory-bandwidth-bound and that the overhead multiplier ($\sim$2$\times$ over the pure bandwidth floor) is consistent across deployments.
|
|
|
|
\textbf{Anchor~3: Llama~3 Training at 16K H100s (Distributed Training).}
|
|
Meta's Llama~3 training report~\citep{llama3team2024} documents achieving 38--43\% MFU on 16{,}384 H100 GPUs with 4D parallelism (DP$\times$TP$\times$PP$\times$CP). We configure \mlsysim's \texttt{DistributedModel} with an equivalent fleet (2{,}048 nodes $\times$ 8 H100s, 400\,Gb/s InfiniBand per node, TP=8, PP=4, DP=512) training a 405B-parameter model. After accounting for pipeline bubble overhead (\Cref{eq:bubble}; with $V{=}1$ and $M{=}\text{PP}{=}4$, the bubble fraction is $3/7 \approx 43\%$, which Llama~3 mitigates via interleaved scheduling with $V{>}1$ and large $M$, reducing the effective bubble to ${\sim}10$--$15\%$), hierarchical AllReduce cost (\Cref{eq:hierarchical}), and compute--communication overlap ($\eta_{\text{overlap}} = 0.85$), \mlsysim predicts an aggregate MFU of 40.2\%, within the reported 38--43\% range. This validates the distributed training model at production scale.
|
|
|
|
\textbf{Anchor~4: PaLM Scaling Efficiency (Communication Overhead).}
|
|
Google's PaLM report~\citep{chowdhery2022palm} shows MFU declining from ${\sim}57\%$ on a single TPU~v4 pod (6{,}144 chips) to ${\sim}46\%$ at the full 64{,}000-chip scale due to communication overhead. Using \mlsysim's \texttt{DistributedModel} with TPU~v4 specifications (275\,TFLOP/s BF16, ICI bandwidth) and the PaLM-540B workload, the predicted MFU drops from 55\% (single pod) to 44\% (full scale), tracking the reported degradation within $\pm$3 percentage points. The key factor is the inter-pod communication cost: \mlsysim correctly identifies the intra-pod to inter-pod bandwidth transition as the dominant scaling bottleneck.
|
|
|
|
\textbf{Anchor~5: Chinchilla Scaling Laws (Algorithmic Scaling).}
|
|
The Chinchilla paper~\citep{hoffmann2022chinchilla} establishes that compute-optimal training requires $D \approx 20P$ tokens. \mlsysim's \texttt{ScalingModel} implements the parametric scaling law $C = 6PD$ and derives the optimal allocation $P^{*} = \sqrt{C/120}$. For $C = 10^{24}$ FLOPs, the solver predicts $P^{*} \approx 91$B parameters with $D^{*} \approx 1.8$T tokens. Chinchilla (70B, 1.4T tokens) was trained at a slightly smaller compute budget ($C \approx 5 \times 10^{23}$), for which the solver predicts $P^{*} \approx 65$B---within 7\% of the actual 70B. This validates the scaling law implementation against its original calibration data.
|
|
|
|
\textbf{Anchor~6: Training Carbon Footprint (Sustainability).}
|
|
\citet{patterson2021carbon} report that training GPT-3 (175B parameters) on V100 GPUs consumed approximately 1{,}287\,MWh and emitted 552 tonnes CO$_2$. We configure \mlsysim's \texttt{SustainabilityModel} with the reported parameters (10{,}000 V100 GPUs, 34 days, PUE of 1.1, US average grid at 429~gCO$_2$/kWh). The solver estimates 1{,}198\,MWh energy consumption and $1{,}198 \times 429 / 1{,}000 = 514$ tonnes CO$_2$, a 7\% energy underestimate and 7\% carbon underestimate. Both discrepancies are consistent with our omission of host CPU, networking, and storage power draw, which contribute to the remaining $\sim$90\,MWh gap.
|
|
|
|
\textbf{Anchor~7: Llama~3 Parallelism Strategy (Optimizer Convergence).}
|
|
To validate the Tier 3 design-space search, we configure the \texttt{ParallelismOptimizer} with the Meta Llama~3 405B model and its 16{,}384 H100 cluster constraints~\citep{llama3team2024}. When asked to find the compute-optimal 4D parallelism split that maximizes MFU under the memory capacity constraints of the 80GB HBM, the optimizer automatically converges on $\text{TP}{=}8$, $\text{PP}{=}4$, $\text{DP}{=}512$. This is the exact strategy published by Meta, confirming that the optimizer accurately identifies the global maximum within the complex interacting constraints of memory ceilings and network topology.
|
|
|
|
These seven anchors span five of the six taxonomy domains (Node, Data is validated indirectly via the ResNet pipeline-bound case in \Cref{sec:usage}, Algorithm, Fleet, Operations) and cover both Roofline regimes (compute-bound and memory-bound). \Cref{tab:validation} summarizes the results. Every hardware entry in the Silicon Zoo includes \texttt{metadata.source\_url} and \texttt{metadata.last\_verified} fields, ensuring traceability to the vendor datasheets from which constants are sourced.
|
|
|
|
\begin{table}[!t]
|
|
\centering
|
|
\caption{\textbf{Validation Summary.} Predicted vs.\ reported values across seven empirical anchors. Error is $|(\text{pred.} - \text{rep.}) / \text{rep.}|$.}
|
|
\label{tab:validation}
|
|
\small
|
|
\resizebox{\columnwidth}{!}{%
|
|
\renewcommand{\arraystretch}{1.15}
|
|
\begin{tabular}{@{}l l l r@{}}
|
|
\toprule
|
|
\textbf{Anchor} & \textbf{Predicted} & \textbf{Reported} & \textbf{Error} \\
|
|
\midrule
|
|
1: ResNet-50 DGX A100 & 37{,}000 s/s & 38{,}200 s/s & 3.1\% \\
|
|
2: Llama-2 70B ITL & 42\,ms & 40--50\,ms & in range \\
|
|
3: Llama~3 MFU & 40.2\% & 38--43\% & in range \\
|
|
4: PaLM scaling & 44\% MFU & $\sim$46\% MFU & 4.3\% \\
|
|
5: Chinchilla $P^*$ & 65B & 70B & 7.1\% \\
|
|
6: GPT-3 CO$_2$ & 514\,t & 552\,t & 6.9\% \\
|
|
7: Llama~3 Parallelism & TP=8, PP=4, DP=512 & TP=8, PP=4, DP=512 & 0.0\% \\
|
|
\bottomrule
|
|
\end{tabular}%
|
|
}
|
|
\end{table}
|
|
|
|
\subsection{Design-Space Exploration Speed}
|
|
|
|
\mlsysim's analytical engine sweeps over 1,000 hardware--model--precision configurations in under one second on a standard laptop. In contrast, ASTRA-sim~2.0 requires hours to simulate a single distributed training configuration at cycle-level fidelity~\citep{won2023astrasim2}. This three-order-of-magnitude speedup is the design objective. Sub-second execution enables interactive parametric sweeps (e.g., varying HBM bandwidth, substituting fat-tree for torus topology, or relocating the datacenter from Iowa to Singapore) that would be impractical with cycle-accurate simulation. \Cref{fig:heatmap} illustrates one such sweep: a 42-point grid of model size versus HBM bandwidth, where each cell represents a single solver invocation and the entire map executes in under 50\,ms on a standard laptop.
|
|
|
|
\begin{figure}[!t]
|
|
\centering
|
|
\includegraphics[width=0.9\columnwidth]{images/pdf/design-space-heatmap.pdf}
|
|
\caption{\textbf{Design-Space Exploration: Bottleneck Regime Map.} Each cell shows the binding constraint (memory or compute) for a given model size and HBM bandwidth combination under FP16 training at batch size 256. Larger models with lower bandwidth are memory-bound; smaller models with higher bandwidth are compute-bound. The diagonal regime boundary shifts as hardware generations increase bandwidth (A100 $\to$ H100 $\to$ B200). The entire 42-point grid executes in $<$50\,ms.}
|
|
\label{fig:heatmap}
|
|
\end{figure}
|
|
|
|
\subsection{Accuracy Scope and Limitations}
|
|
|
|
\mlsysim provides first-order estimates, not cycle-accurate predictions. The MFU parameter in each solver absorbs second-order effects (cache behavior, warp scheduling, OS overhead) into a single empirically-calibrated efficiency coefficient~$\eta$, following the Roofline model's use of achievable bandwidth rather than peak bandwidth~\citep{williams2009roofline}. \Cref{sec:discussion} details the specific phenomena this abstraction cannot capture and the resulting accuracy boundaries.
|
|
|
|
For \mlsysim's intended use cases (architectural reasoning, lab exercises, capacity planning, and design-space exploration), first-order accuracy is sufficient and often preferable. A student who understands \emph{why} a system is memory-bound has learned more than one who can predict its throughput to three decimal places. To safeguard against model drift, the test suite includes empirical anchor tests that fail automatically if predictions deviate beyond $\pm$10\% of published values.
|
|
|
|
\subsection{Dimensional Correctness as Validation}
|
|
|
|
Beyond numerical accuracy, \mlsysim enforces a structural form of validation through dimensional analysis. As described in \Cref{sec:dimensional}, every physical quantity carries \texttt{pint} units at runtime, so FLOP/s cannot be added to GB/s and latency cannot be compared to bandwidth without explicit conversion. This eliminates the category of silent unit-conversion errors that plague ad-hoc spreadsheet models and cannot be caught by numerical validation alone.
|
|
|
|
\section{Usage \& Case Studies}
|
|
\label{sec:usage}
|
|
|
|
\mlsysim is designed for three audiences: students developing quantitative reasoning skills, instructors preparing demonstrations, and researchers evaluating design trade-offs. We present two representative use cases per persona, each illustrating how solvers compose to answer questions that span multiple walls.
|
|
|
|
\subsection{Student Use Cases}
|
|
|
|
Students interact with \mlsysim primarily through single-solver queries, short resolver chains, and interactive WebAssembly-powered web applications. By embedding Marimo notebooks directly into the companion textbook, students can manipulate hardware parameters (e.g., batch size, SLA targets, carbon taxes) via UI sliders and instantly observe how binding constraints shift in real time without needing a backend server or physical hardware.
|
|
|
|
We present two examples chosen to illustrate complementary aspects of the framework. The first (S1) demonstrates \emph{vertical} resolver composition, chaining inference and economics models to connect an algorithmic decision (chain-of-thought reasoning) to its infrastructure cost. The second (S2) demonstrates \emph{horizontal} composition, combining data-pipeline and compute models to diagnose a bottleneck that shifts between walls as batch size increases.
|
|
|
|
\subsubsection{S1: Chain-of-Thought Cost}
|
|
A student investigates the inference economics of chain-of-thought (CoT) prompting. Using a GPT-4-scale Transformer (1.8T parameters) on a fleet of H100 nodes, they configure the \texttt{InferenceScalingModel} with $K{=}8$ reasoning steps. Each step generates $\sim$128 tokens at the memory-bound decode rate, so the total reasoning time is $T_{\text{reason}} = \text{TTFT} + K \cdot 128 \cdot \text{ITL}$. The solver reports that $K{=}8$ CoT multiplies per-query latency by $7.6{\times}$ relative to a single-step answer. Feeding this into the \texttt{EconomicsModel}, the student finds that at 100~QPS the annualized serving cost rises from \$1.2M to \$9.1M. This result quantifies CoT as a direct multiplier on infrastructure cost, not merely on latency.
|
|
|
|
\subsubsection{S2: CPU Pipeline Bottleneck}
|
|
A student configures ResNet-50 training on a DGX A100 (8 GPUs) with a batch size of 2{,}048. The \texttt{SingleNodeModel} predicts a per-step compute time of 48\,ms, yielding a demand rate of $2{,}048 / 0.048 \approx 42{,}700$ images/s. Adding the \texttt{DataModel} reveals the ingestion wall: with 64 CPU workers (8 per GPU) decoding ImageNet JPEGs at 1{,}200 images/s each, the raw decode pipeline delivers 76{,}800 images/s. This appears sufficient, but the \texttt{TransformationModel} accounts for the full augmentation pipeline (random crop, color jitter, normalization) at 850 images/s per worker, reducing effective throughput to $64 \times 850 = 54{,}400$ images/s. At batch 2{,}048, the headroom is slim: $54{,}400 / 42{,}700 = 1.27{\times}$. Doubling the batch to 4{,}096 doubles demand to 85{,}400 images/s, exceeding the CPU pipeline's capacity and triggering GPU stalls. The student discovers that the binding constraint shifts from Wall~1 (Compute) to Wall~9 (Transformation) as batch size increases---the GPU has spare cycles, but the CPUs cannot feed it fast enough.
|
|
|
|
\subsection{Instructor Use Cases}
|
|
|
|
Instructors need demonstrations that run in real time, produce concrete numbers, and connect cleanly to lecture narratives. The following cases show how \mlsysim turns abstract concepts into live, interactive classroom demonstrations.
|
|
|
|
\subsubsection{I1: Live Roofline Demo, Batch Size Sweep}
|
|
An instructor demonstrates the Roofline model by sweeping batch size from 1 to 256 on an H100 for a 7B-parameter Transformer. At each batch size, the \texttt{SingleNodeModel} returns both the bottleneck label and the MFU:
|
|
|
|
\begin{center}
|
|
\small
|
|
\begin{tabularx}{\columnwidth}{@{}rXlr@{}}
|
|
\toprule
|
|
Batch & Bottleneck & AI (FLOP/B) & MFU \\
|
|
\midrule
|
|
1 & Memory & 1.1 & 0.02 \\
|
|
8 & Memory & 8.6 & 0.14 \\
|
|
32 & Memory & 34.3 & 0.42 \\
|
|
128 & Compute & 137 & 0.61 \\
|
|
256 & Compute & 274 & 0.63 \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\end{center}
|
|
|
|
The crossover from memory-bound to compute-bound occurs near batch 64, where the arithmetic intensity $\text{AI} = 2B \cdot P / |W|$ crosses the effective ridge point ($\eta \times F_{\text{peak}} / \text{BW}_{\text{HBM}}$). At the achieved MFU ($\eta \approx 0.63$), the effective ridge is $\approx$186\,FLOP/byte, and the workload's AI at batch 64 is $\approx$69\,FLOP/byte---just entering the transition region where further batching begins yielding diminishing returns. \Cref{fig:roofline} visualizes this transition on the Roofline diagram. The entire sweep executes in $<$50\,ms, enabling real-time interaction during lecture.
|
|
|
|
\begin{figure}[!t]
|
|
\centering
|
|
\includegraphics[width=0.9\columnwidth]{images/pdf/roofline-crossover.pdf}
|
|
\caption{\textbf{Roofline Crossover: Batch Size Sweep on H100.} Increasing batch size moves the operating point rightward along the Roofline, transitioning from memory-bound (red) to compute-bound (blue). The peak ridge point is $989/3.35 \approx 295$\,FLOP/byte; at achieved MFU the effective crossover occurs at lower arithmetic intensity.}
|
|
\label{fig:roofline}
|
|
\end{figure}
|
|
|
|
\subsubsection{I2: Iowa vs.\ Qu\'ebec Carbon}
|
|
An instructor poses a policy question: \emph{does geography matter for carbon footprint?} Using the \texttt{DistributedModel}, they configure a 256-GPU cluster training a 70B model for 30 days. The \texttt{SustainabilityModel} then computes emissions under two grid profiles: Iowa (680~gCO$_2$/kWh, circa 2020 coal/gas grid) and Qu\'ebec (17~gCO$_2$/kWh, hydroelectric). The configuration is identical in hardware, model, and MFU, yet carbon footprint differs by $40{\times}$ (412 vs.\ 10.3 tonnes CO$_2$; 231 vs.\ 5.8\,kL water), demonstrating that grid carbon intensity is a first-order systems design variable (Wall~18). \Cref{fig:carbon} visualizes this contrast.
|
|
|
|
\begin{figure}[!t]
|
|
\centering
|
|
\includegraphics[width=0.9\columnwidth]{images/pdf/carbon-comparison.pdf}
|
|
\caption{\textbf{Geography as a Systems Variable: Iowa vs.\ Qu\'ebec.} An identical 256-GPU cluster training a 70B model for 30 days produces $40{\times}$ more CO$_2$ in Iowa (coal/gas grid circa 2020, 680~gCO$_2$/kWh) than in Qu\'ebec (95\% hydroelectric, 17~gCO$_2$/kWh). Water usage follows the same ratio. Hardware, model, parallelism strategy, and MFU are identical; only the regional grid carbon intensity differs between the two sites.}
|
|
\label{fig:carbon}
|
|
\end{figure}
|
|
|
|
\subsection{Researcher Use Cases}
|
|
|
|
Researchers need to evaluate architectural alternatives and justify procurement decisions with quantitative evidence. The following cases show how \mlsysim enables rapid what-if analysis across hardware generations and parallelism strategies.
|
|
|
|
\subsubsection{R1: GPU vs.\ Cerebras Crossover}
|
|
A researcher evaluates whether wafer-scale silicon changes the economics of large-model inference. For a 180B-parameter model on H100s (the 360\,GB FP16 checkpoint requires multi-GPU tensor parallelism across 5 GPUs), \mlsysim reports a decode-phase ITL of $360\;\text{GB} / (5 \times 3.35\;\text{TB/s}) = 21.5$\,ms/token at batch 1 (memory-bound). The \texttt{WeightStreamingModel} models the Cerebras WSE-3: at batch size 1, the full 360\,GB must stream from MemoryX at 1.2\,TB/s, yielding a per-token latency of $360 / 1{,}200 = 300$\,ms---$14{\times}$ \emph{slower} than the GPU cluster. However, the Cerebras architecture's advantage emerges at scale: at the optimal batch $B^{*} \approx 41{,}700$ (\Cref{eq:weightstream}), injection cost is fully amortized across tokens, achieving ${\sim}139{,}000$ tokens/s aggregate throughput versus ${\sim}47$ tokens/s per GPU (requiring a fleet of ${\sim}3{,}000$ H100s to match). The solver flags that at $B^{*}$, the activation footprint approaches the 44\,GB SRAM ceiling. The \texttt{SensitivitySolver} confirms the qualitative regime change: the binding constraint shifts from $\text{BW}_{\text{HBM}}$ (GPU) to $\text{BW}_{\text{inject}}$ (WSE-3), illustrating that the optimal architecture depends critically on batch size.
|
|
|
|
\subsubsection{R2: Hardware Procurement Audit}
|
|
A researcher preparing a hardware procurement recommendation for LLaMA-70B inference needs to answer: \emph{should the next-generation cluster prioritize FLOPS or bandwidth?} The \texttt{SensitivitySolver} computes numerical partial derivatives of inference latency with respect to each hardware parameter:
|
|
|
|
\begin{lstlisting}[caption={\textbf{Sensitivity Analysis.} Identifying the binding constraint and its partial derivatives.},label={lst:sensitivity}]
|
|
import mlsysim
|
|
solver = mlsysim.SensitivitySolver()
|
|
res = solver.solve(model=Models.LLAMA_70B,
|
|
hardware=Hardware.A100)
|
|
print(res.sensitivities)
|
|
# {'peak_flops': -0.06, 'memory_bandwidth': -0.88,
|
|
# 'memory_capacity': 0.00}
|
|
print(f"Binding: {res.binding_constraint}")
|
|
# Output: memory_bandwidth
|
|
\end{lstlisting}
|
|
|
|
The result is unambiguous: $\partial T / \partial \text{BW} = -0.88$ while $\partial T / \partial \text{FLOPS} = -0.06$, meaning a 10\% increase in HBM bandwidth yields an 8.8\% latency reduction, whereas a 10\% increase in peak FLOPS yields only 0.6\%. The \texttt{SynthesisSolver} then performs the inverse solve: given a 50\,ms inter-token latency SLA, it synthesizes the minimum required bandwidth as $\text{BW}_{\text{req}} = |W| / T_{\text{target}} = 2{,}800$\,GB/s, $1.4{\times}$ the A100's 2\,TB/s, confirming that LLaMA-70B inference is firmly in the memory-bound regime and that hardware procurement should prioritize HBM bandwidth over peak throughput.
|
|
|
|
\subsubsection{R3: End-to-End LLaMA-70B Training Audit}
|
|
To illustrate how \mlsysim's solvers compose across all six taxonomy domains, we trace a complete training analysis for LLaMA-70B on 512 H100 GPUs (64 nodes $\times$ 8 GPUs, NVLink intra-node, 400\,Gb/s InfiniBand inter-node) in Qu\'ebec (17~gCO$_2$/kWh, PUE 1.1).
|
|
|
|
\textbf{Node (Walls 1--3).} With DP degree 64, each DP rank processes $4\text{M} / 64 = 62{,}500$ tokens per step. Within each rank, TP=8 partitions the model across 8 local GPUs. The per-rank compute demand is $C_{\text{rank}} = 6 \times 70\text{B} \times 62{,}500 = 2.63 \times 10^{16}$ FLOPs. The 8-GPU TP group delivers $989 \times 10^{12} \times 0.40 \times 8 = 3.17 \times 10^{15}$ FLOP/s (at $\eta = 0.40$), yielding a per-step compute time of $T_{\text{compute}} = 2.63 \times 10^{16} / 3.17 \times 10^{15} = 8.3$\,s. The \texttt{SingleNodeModel} classifies this as compute-bound: memory bandwidth ($8 \times 3.35 = 26.8$\,TB/s) can stream the 140\,GB model weights in 5.2\,ms, far below the 8.3\,s compute time.
|
|
|
|
\textbf{Data (Walls 8--10).} The \texttt{DataModel} checks ingestion: at global batch size 4M tokens, the data pipeline must sustain $4\text{M} \times 2\;\text{bytes} / 8.3\;\text{s} \approx 0.96$\,MB/s from storage---a trivially small demand. With NVMe delivering 6.5\,GB/s per node across 64 nodes, the pipeline is not the bottleneck. (For LLM training, tokenized data is compact; the data wall binds primarily in vision tasks with large image payloads.)
|
|
|
|
\textbf{Algorithm (Walls 11--13).} The \texttt{ScalingModel} verifies that the training budget is compute-optimal: for $C = 2 \times 10^{24}$ FLOPs, the Chinchilla-optimal model size is $P^{*} \approx 130$B, indicating that 70B at this budget is slightly over-trained (more data per parameter than optimal), a deliberate choice for inference efficiency.
|
|
|
|
\textbf{Fleet (Walls 14--16).} The \texttt{DistributedModel} models communication: with TP=8 (intra-node NVLink, 900\,GB/s) and DP=64 (inter-node IB, 50\,GB/s per port), the DP AllReduce synchronizes $140\;\text{GB} / 8 = 17.5$\,GB of gradients per TP rank across 64 nodes. Ring AllReduce cost is $2 \times (63/64) \times 17.5\;\text{GB} / 50\;\text{GB/s} \approx 689$\,ms. With $\eta_{\text{overlap}} = 0.85$, only $0.15 \times 689 = 103$\,ms of communication is exposed, yielding a scaling efficiency of $8.3 / (8.3 + 0.103) = 98.8\%$ at 64 nodes. The \texttt{ReliabilityModel} estimates cluster MTBF $= 10{,}000\;\text{hrs} / 512 \approx 19.5$\,hrs, requiring hourly checkpoints.
|
|
|
|
\textbf{Operations (Walls 17--20).} The \texttt{EconomicsModel} projects a 30-day training run: CapEx (512 H100s at \$30K each) of \$15.4M amortized over 3 years yields a per-run allocation of $\$15.4\text{M} \times 30/1{,}095 \approx \$422\text{K}$, plus OpEx (power at 700W $\times$ 512 GPUs $\times$ 720\,hrs at \$0.06/kWh) of \$15.5K, for a total run TCO of $\sim$\$0.44M. The \texttt{SustainabilityModel} estimates 285\,MWh energy and 5.3 tonnes CO$_2$---$40{\times}$ less carbon than the same run in Iowa (680~gCO$_2$/kWh).
|
|
|
|
\textbf{Analysis (Walls 21--22).} The \texttt{SensitivitySolver} confirms the binding constraint is compute (Wall~1), not memory or communication, with $\partial T / \partial F_{\text{peak}} = -0.91$. The \texttt{SynthesisSolver} synthesizes the minimum hardware to complete training in 14 days: 1{,}024 GPUs, doubling DP to 128.
|
|
|
|
This end-to-end trace exercises 12 of the 22 walls through a single model, demonstrating how solver composition produces a holistic system assessment from individual physics-based constraint equations.
|
|
|
|
\subsubsection{R4: Automated Parallelism Search (Tier 3 Optimizer)}
|
|
A researcher needs to schedule a 175B-parameter model on a new 2{,}048-GPU cluster. Manually searching the 3D-parallelism space ($\text{TP} \times \text{PP} \times \text{DP}$) is error-prone: a split that maximizes DP might exceed the 80GB HBM capacity, while a split that maximizes TP might saturate the NVLink interconnect. Instead of trial and error, the researcher invokes a Tier 3 Optimizer. They configure the \texttt{ParallelismOptimizer} with the workload and cluster constraints, setting the objective to maximize MFU subject to $M_{\text{peak}} \le 72\text{GB}$ (leaving 10\% headroom). The optimizer performs a constrained grid search over all valid algebraic factorizations of 2{,}048, evaluating the \texttt{DistributedModel} at each point. In under 0.5 seconds, it returns the optimal schedule: $\text{TP}{=}8$, $\text{PP}{=}8$, $\text{DP}{=}32$, correctly deducing that TP must match the intra-node GPU count to avoid traversing the slower inter-node fabric, and that PP=8 is the minimum pipeline depth required to fit the remaining state in memory. This demonstrates the power of the ``engineering engine'' to invert the analytical models into automated design-space synthesis.
|
|
|
|
\section{Fallacies \& Pitfalls}
|
|
\label{sec:fallacies}
|
|
|
|
Following the tradition of \citet{hennessy2024architecture}, we highlight common misconceptions that \mlsysim is designed to expose.
|
|
|
|
\textbf{Fallacy: Doubling peak FLOP/s halves training time.}
|
|
A student might assume that upgrading from A100 (312\,TFLOP/s FP16) to H100 (989\,TFLOP/s, a $3.2{\times}$ increase) should yield a $3.2{\times}$ speedup. \mlsysim's \texttt{SingleNodeModel} reveals why this is false: LLM inference is memory-bandwidth-bound, and the A100-to-H100 bandwidth improvement is only $1.7{\times}$ (2.0~TB/s $\to$ 3.35~TB/s). For memory-bound workloads, training time scales with bandwidth, not FLOPS. The Roofline model (\Cref{eq:bottleneck}) makes this visible: the binding constraint determines which hardware parameter matters.
|
|
|
|
\textbf{Fallacy: Communication overhead is negligible at 512 GPUs.}
|
|
At 512 H100 GPUs (64 nodes), the DP AllReduce for a 70B model's 17.5\,GB gradient shard costs 689\,ms, of which only 103\,ms is exposed after 85\% compute--communication overlap---just 1.2\% of the 8.3\,s compute step. But the \texttt{ReliabilityModel} reveals the hidden cost: cluster MTBF drops to $\sim$20 hours, requiring hourly checkpoints that each pause training for 30--60 seconds. Over a 30-day run, checkpoint overhead (Wall~19) exceeds communication overhead (Wall~14) by $10{\times}$, a cost invisible to communication-only analysis.
|
|
|
|
\textbf{Pitfall: Using peak bandwidth in back-of-the-envelope calculations.}
|
|
Vendor datasheets report peak HBM bandwidth (e.g., 3.35\,TB/s for H100). In practice, sustained bandwidth under real workloads is 70--85\% of peak due to bank conflicts, address patterns, and memory controller scheduling~\citep{nvidia2023h100}. A student using peak bandwidth will underestimate LLM decode latency by 15--30\%. \mlsysim's MFU parameter ($\eta$) explicitly accounts for this gap, and the default ranges (\Cref{eq:efficiency}) guide students toward realistic estimates.
|
|
|
|
\textbf{Pitfall: Ignoring geography in carbon accounting.}
|
|
Two identical training runs produce vastly different environmental impact depending on grid carbon intensity. As demonstrated in Case~I2, the same 256-GPU cluster emits $40{\times}$ more CO$_2$ in Iowa than in Qu\'ebec. Students who omit Wall~18 (Sustainability) from their analysis miss a first-order systems design variable---one that increasingly affects both cost (carbon pricing) and regulatory compliance.
|
|
|
|
\textbf{Fallacy: Quantization always provides a linear speedup.}
|
|
Reducing precision from FP16 to INT4 ($r_{\text{quant}} = 4{\times}$) reduces memory by $4{\times}$, but the compute speedup depends on whether the workload is memory-bound or compute-bound. For memory-bound LLM decode, the $4{\times}$ memory reduction translates to nearly $4{\times}$ throughput improvement because the bottleneck is weight reads. For compute-bound training, the same quantization provides zero throughput benefit because compute---not memory---is the ceiling. \mlsysim's Roofline analysis makes this regime-dependent behavior explicit.
|
|
|
|
\section{Discussion \& Limitations}
|
|
\label{sec:discussion}
|
|
|
|
``All models are wrong, but some are useful''~\citep{box1976science}. Users must understand the boundaries of \mlsysim's analytical abstraction. We organize the limitations into modeling scope, accuracy trade-offs, and pedagogical implications, then outline future directions.
|
|
|
|
\subsection{What \mlsysim Cannot Model}
|
|
|
|
\textbf{No microarchitectural effects.} \mlsysim has no notion of L1/L2 cache hierarchies, branch prediction, warp scheduling, or register pressure. These second-order effects are absorbed into a single scalar efficiency parameter ($\eta$, the ratio of sustained to peak FLOP/s). While $\eta$ provides a serviceable approximation for back-of-the-envelope reasoning, it cannot capture workload-dependent microarchitectural behavior; a matrix multiply and a sparse attention kernel may achieve very different $\eta$ on identical silicon.
|
|
|
|
\textbf{No real network congestion.} The communication model uses the classical $\alpha$-$B_{\text{link}}$ formulation (latency plus inverse-bandwidth), which assumes dedicated links. \mlsysim does not model adaptive routing, network contention under multi-tenant traffic, or congestion collapse, phenomena that become critical at scales beyond $\sim$10{,}000 nodes, precisely the regime where ASTRA-sim~2.0~\citep{won2023astrasim2} provides essential fidelity.
|
|
|
|
\textbf{No OS/runtime overhead.} Kernel launch latency, CUDA stream scheduling, Python GIL contention, and host--device transfer overhead are absent. For inference-dominated workloads where kernel launch time can rival compute time, this omission can meaningfully affect predictions.
|
|
|
|
\textbf{No heterogeneous fleets.} The \texttt{DistributedModel} assumes homogeneous nodes: all accelerators in a fleet share the same compute, memory, and interconnect specifications. Production clusters increasingly mix hardware generations (e.g., A100 and H100 nodes in the same job), and fleet-level efficiency metrics such as ML Productivity Goodput~\citep{wongpanich2025fleet} capture this heterogeneity. Modeling heterogeneous fleets would require per-node load balancing and straggler analysis beyond the current analytical framework.
|
|
|
|
\textbf{No dynamic behavior.} \mlsysim models steady-state throughput. Transient effects (thermal throttling, dynamic clock boosting, memory fragmentation over long training runs, and checkpoint I/O bursts) are outside its scope. A training run that degrades over 72 hours due to thermal saturation will appear identical to one that sustains peak throughput.
|
|
|
|
\textbf{Heuristic accuracy models.} The \texttt{CompressionModel}'s accuracy degradation curves are heuristic step functions (for quantization) and exponentials (for pruning), not architecture-specific empirical fits. Real accuracy loss depends on model architecture, calibration methodology, and quantization method (e.g., well-calibrated GPTQ INT4 can achieve $<$0.5\% degradation, while naive round-to-nearest may lose 10\%+). Users should treat these curves as directional indicators, not ground truth.
|
|
|
|
\subsection{Walls Not Included}
|
|
|
|
The 22-wall taxonomy is comprehensive but not exhaustive. Several constraints were considered and excluded. \textbf{Thermal throttling}: sustained power density can force throughput below peak TDP, but this is absorbed into $\eta$ rather than modeled as a distinct wall. \textbf{Resource fragmentation}: scattered GPU availability across nodes prevents job scheduling even when aggregate capacity is sufficient; this is a combinatorial bin-packing problem beyond the current analytical framework. \textbf{Compiler/graph optimization}: the gap between a framework's computational graph and the executed kernel schedule affects both latency and MFU, but varies too rapidly across software versions to model analytically. \textbf{Dynamic network congestion}: multi-tenant traffic creates contention beyond the static bisection bandwidth model (Wall~10); cycle-level simulators like ASTRA-sim~\citep{won2023astrasim2} are better suited for this regime. The selection criterion for inclusion was: does the constraint have a stable, published analytical formulation that remains valid across hardware generations? Constraints requiring empirical trace data or combinatorial optimization were deferred to future work.
|
|
|
|
\subsection{The Accuracy--Speed Trade-off}
|
|
|
|
Each omission above reflects the same trade-off: three orders of magnitude improvement in evaluation speed at the cost of second-order fidelity. The precedent is the MIPS/SPIM simulator~\citep{hennessy2024architecture}, which models pipeline hazards and stalls but omits superscalar execution and cache hierarchies---prioritizing pedagogical clarity over the full complexity of commercial processors. \mlsysim applies the same philosophy to ML systems, making quantitative reasoning accessible to students who may never operate a production cluster.
|
|
|
|
The relevant question is not ``How accurate is \mlsysim?'' but ``Does it identify the correct binding constraint?'' A first-order model that correctly determines whether a system is memory-bound, compute-bound, or network-bound provides actionable architectural insight even when its absolute latency prediction is $\pm$20\% from a cycle-accurate trace. The binding constraint dictates which hardware investment yields the largest return, and this ordinal ranking is far more robust than cardinal predictions. Practitioners who know that their system is memory-bandwidth-bound will invest in higher-bandwidth memory regardless of whether the predicted latency is 47\,ms or 53\,ms.
|
|
|
|
\subsection{Future Work}
|
|
|
|
We identify several directions for extending \mlsysim.
|
|
|
|
\textbf{Broader empirical validation.} \Cref{sec:validation} validates against six anchors spanning five domains. Future work will extend this to additional hardware generations (B200, TPU~v6e), inference serving under load (continuous batching with realistic request distributions), and checkpoint overhead at scale, where published data points are becoming available from reproducibility studies.
|
|
|
|
\textbf{Community hardware registry.} The Silicon Zoo currently contains a curated set of hardware entries verified against manufacturer datasheets. We plan to open contributions from the community, with automated verification scripts that cross-check submitted specifications against known physical limits (e.g., memory bandwidth cannot exceed pin count $\times$ data rate).
|
|
|
|
\textbf{Custom degradation curves.} Future versions will allow users to supply empirically fitted accuracy-compression curves from their own quantization experiments, replacing the current heuristic polynomials with data-grounded models.
|
|
|
|
\textbf{TinyTorch integration.} \mlsysim provides analytical predictions; TinyTorch~\citep{tinytorch2025}, the companion educational framework, provides implementation-based verification. Connecting the two tools creates a predict-then-verify loop: students estimate training time and memory consumption in \mlsysim, then run the actual training in TinyTorch and compare. This closed loop reinforces quantitative reasoning by grounding analytical models in empirical observation.
|
|
|
|
\textbf{Expanding Tier 3 Optimizers (Pareto Frontiers).} Currently, the Tier 3 optimizers search single-dimensional objective spaces (e.g., maximizing MFU or maximizing batch size under a latency constraint). Future work will extend the Tier 3 engine to support multi-objective Pareto frontiers, simultaneously optimizing across latency, total cost, carbon footprint, and accuracy. This will enable richer design-space exploration and formally expose the inherent tensions between performance and sustainability.
|
|
|
|
\subsection{The Pedagogical Argument}
|
|
|
|
Even when \mlsysim's predictions deviate by 20\% from measured values, the pedagogical value lies in the reasoning process rather than the numerical output. A student who sweeps 1{,}000 configurations and identifies memory bandwidth as the binding constraint for LLM inference has acquired a transferable analytical skill: determining which resource limits performance. The framework trains students to formulate the correct quantitative questions (arithmetic intensity, ridge point location, communication-to-computation ratio) and these questions generalize to production systems even as specific hardware parameters change across generations.
|
|
|
|
\section{Conclusion}
|
|
\label{sec:conclusion}
|
|
|
|
Machine learning has become infrastructure, yet the tools for reasoning about that infrastructure remain either too slow for interactive exploration or too narrow for full-stack analysis. \mlsysim addresses this gap by formalizing the physics of ML systems into a dimensionally strict, composable engine. Its 5-layer progressive lowering architecture cleanly separates computational demand from silicon supply, while 25 resolvers (20 models, 2 solvers, and 3 optimizers) spanning 22 systems walls codify every fundamental bottleneck---from single-accelerator compute ceilings to fleet-scale carbon accounting---into a unified analytical suite.
|
|
|
|
By evaluating complete system configurations in under one second, \mlsysim makes full-stack ML systems analysis feasible on commodity hardware without access to production clusters. The engine's integration with Marimo enables interactive, WebAssembly-powered web applications that allow students to physically manipulate hardware variables and observe constraint boundaries shift in real-time. Because all components are deterministic and purely analytical, labs built on \mlsysim are fully autogradable and produce identical results across platforms, ensuring that students at resource-constrained institutions engage with the same exercises as those at well-funded research universities.
|
|
|
|
\mlsysim is part of a broader curriculum vision. Together with TinyTorch~\citep{tinytorch2025}, which teaches how ML frameworks work internally through progressive implementation, \mlsysim teaches how to reason about ML systems at scale through analytical modeling. The two tools provide complementary coverage: TinyTorch addresses the framework internals from the bottom up (tensors, autograd, optimizers), while \mlsysim addresses systems-level analysis from the top down (scaling laws, binding constraints, fleet economics). Both are available as part of the open-source \emph{Machine Learning Systems} textbook~\citep{mlsysbook2025} at \url{https://mlsysbook.ai}.
|
|
|
|
As ML systems grow in scale and complexity (trillion-parameter models, million-device fleets, multi-modal pipelines), the need for rapid analytical reasoning tools will increase correspondingly. \mlsysim provides a foundation for that reasoning: not a replacement for empirical measurement, but a complement that narrows the design space before committing hardware resources to validation.
|
|
|
|
\bibliography{references}
|
|
|
|
\end{document}
|