docs(mlsysim): align website consistency and expand math foundations

- Fix landing page latency to match accuracy.qmd (0.34→0.42 ms, 2941→2381 img/s)
- Add Three-Level Evaluation wall mapping table to architecture.qmd
- Align KV-cache formula: show both simplified and PagedAttention forms
- Add 6 missing wall equation sections to math.qmd (Walls 3, 5, 9, 10, 12, 19)
This commit is contained in:
Vijay Janapa Reddi
2026-03-10 10:07:36 -04:00
parent d497efab46
commit 17c58fbc31
3 changed files with 142 additions and 4 deletions

View File

@@ -186,4 +186,10 @@ The `Scenario.evaluate()` entry point orchestrates solver composition through a
A feasibility failure at Level 1 short-circuits the evaluation — there is no point optimizing AllReduce if the model does not fit in memory.
| Level | Purpose | Walls Evaluated |
|:------|:--------|:----------------|
| 1. Feasibility | Does it fit? | 12 (capacity), 5 (KV-cache), 8 (I/O) |
| 2. Performance | How fast? | 17 (node), 1116 (algorithm + fleet) |
| 3. Macro | What does it cost? | 1720 (operations), 2122 (analysis) |
*See the [Solver Guide](solver-guide.qmd) to learn how to apply these solvers, and [Math Foundations](math.qmd) for the equations behind each.*

View File

@@ -273,8 +273,8 @@ profile = Engine.solve(
)
print(f"Bottleneck: {profile.bottleneck}") # → Memory Bound
print(f"Latency: {profile.latency.to('ms'):~.2f}") # → 0.34 ms
print(f"Throughput: {profile.throughput:.0f} img/s") # → 2941 img/s
print(f"Latency: {profile.latency.to('ms'):~.2f}") # → 0.42 ms
print(f"Throughput: {profile.throughput:.0f} img/s") # → 2381 img/s
```
At batch=1, ResNet-50 loads ~50 MB of weights but performs only ~8 GFLOPs, making it firmly memory-bound on any modern GPU. The solver identifies this in microseconds using the **Iron Law** [@williams2009roofline]:

View File

@@ -169,12 +169,24 @@ requires the same memory load as a full matrix-vector product, but performs far
### 3.3 KV-Cache Size
**Simplified form** (contiguous allocation):
$$
\text{KV\_Bytes} = 2 \times \text{Seq\_Len} \times \text{Batch} \times \text{Hidden\_Size} \times \text{Layers} \times \text{Bytes\_Per\_Param}
\text{KV\_Bytes} = 2 \times L \times H_{\text{kv}} \times d \times S \times B \times b
$$
where $L$ is the number of layers, $H_{\text{kv}}$ is the number of KV heads (equals $H$ for MHA, less for GQA), $d$ is the head dimension, $S$ is the sequence length, $B$ is the batch size, and $b$ is the bytes per parameter.
**PagedAttention form** (paged allocation, as in vLLM):
$$
\text{KV\_Bytes} = 2 \times L \times H_{\text{kv}} \times d \times \lceil S/p \rceil \times p \times B \times b
$$
where $p$ is the page size in tokens. When $p = S$ (no paging), this reduces to the simplified form.
The factor of 2 counts both the K and V matrices. At fp16 (2 bytes/param), a 70B model with
a 4096-token context at batch=32 requires approximately **540 GB** of KV-cachemore than
a 4096-token context at batch=32 requires approximately **540 GB** of KV-cachemore than
a single H100 node can hold.
---
@@ -444,6 +456,126 @@ This enables hardwaresoftware co-design: rather than asking "how fast is my s
---
## 16. Software Efficiency (Wall 3: MFU Decomposition)
*Implemented in [`mlsysim.core.solver.EfficiencySolver`](api/core.solver.EfficiencySolver.qmd).*
The gap between peak and achieved FLOP/s is captured by the utilization parameter $\eta$:
$$
\eta = \frac{\text{Achieved\_FLOP/s}}{\text{Peak\_FLOP/s}}
$$
In practice, $\eta$ decomposes into multiplicative factors:
$$
\eta = \eta_{\text{occupancy}} \times \eta_{\text{fusion}} \times \eta_{\text{precision}} \times \eta_{\text{memory}}
$$
where occupancy captures warp-level parallelism, fusion captures kernel launch elimination, precision captures tensor core utilization, and memory captures data reuse. Well-optimized training (Megatron-LM, DeepSpeed) achieves $\eta \approx 0.35\text{}0.55$; unoptimized inference may drop below 0.10.
---
## 17. Continuous Batching (Wall 5: PagedAttention)
*Implemented in [`mlsysim.core.solver.ContinuousBatchingSolver`](api/core.solver.ContinuousBatchingSolver.qmd).*
Static batching wastes memory by pre-allocating contiguous KV-cache for the maximum sequence length. PagedAttention allocates KV-cache in fixed-size pages:
$$
\text{KV\_paged} = 2 \times L \times H_{\text{kv}} \times d \times \lceil S/p \rceil \times p \times B \times b
$$
The memory savings come from eliminating internal fragmentation. The effective batch size $B_{\text{eff}}$ under a memory budget $M_{\text{budget}}$ is:
$$
B_{\text{eff}} = \left\lfloor \frac{M_{\text{budget}} - |W|}{2 \times L \times H_{\text{kv}} \times d \times S \times b} \right\rfloor
$$
where $|W|$ is the model weight footprint. Continuous batching dynamically inserts and removes requests as they complete, keeping $B_{\text{eff}}$ near maximum at all times.
---
## 18. Data Transformation (Wall 9: CPU Preprocessing)
*Implemented in [`mlsysim.core.solver.TransformationSolver`](api/core.solver.TransformationSolver.qmd).*
CPU preprocessing (decode, resize, tokenize, augment) must keep pace with GPU consumption:
$$
T_{\text{transform}} = \frac{B \times S}{C_{\text{throughput}} \times W}
$$
where $B$ is the batch size, $S$ is the per-sample processing cost (e.g., JPEG decode + resize), $C_{\text{throughput}}$ is the single-core throughput, and $W$ is the number of CPU workers. The pipeline is stalled when:
$$
T_{\text{transform}} > T_{\text{compute}}
$$
Common mitigations: increase $W$ (more workers), cache preprocessed data, or use GPU-accelerated decoding (DALI).
---
## 19. Network Topology (Wall 10: Bisection Bandwidth)
*Implemented in [`mlsysim.core.solver.TopologySolver`](api/core.solver.TopologySolver.qmd).*
The effective bandwidth available for collective communication depends on the network topology and its oversubscription ratio:
$$
\text{BW}_{\text{eff}} = \frac{\text{BW}_{\text{link}} \times \beta}{\text{oversubscription}}
$$
where $\beta$ is the bisection bandwidth ratio (1.0 for full bisection in fat-tree topologies, <1.0 for oversubscribed networks). For a $k$-ary fat-tree with oversubscription ratio $r$:
$$
\text{BW}_{\text{bisection}} = \frac{k^2}{4r} \times \text{BW}_{\text{link}}
$$
Fat-tree topologies (Leiserson, 1985) provide full bisection bandwidth ($r=1$), making them ideal for AllReduce-heavy workloads. Dragonfly and torus topologies trade bisection bandwidth for lower cost at scale.
---
## 20. Inference-Time Scaling (Wall 12: Reasoning Compute)
*Implemented in [`mlsysim.core.solver.InferenceScalingSolver`](api/core.solver.InferenceScalingSolver.qmd).*
Chain-of-thought, tree search, and other inference-time scaling strategies multiply the compute cost by the number of reasoning steps $K$:
$$
T_{\text{reasoning}} = K \times T_{\text{step}}
$$
where $T_{\text{step}}$ is the latency of a single forward pass (from the roofline model). For best-of-$N$ sampling:
$$
T_{\text{best-of-N}} = N \times T_{\text{generate}} + T_{\text{verify}}
$$
The key insight from Snell et al. (2024) is that inference-time compute can substitute for training-time compute: a smaller model with $K$ reasoning steps may match a larger model with a single pass, at the cost of $K\times$ inference latency.
---
## 21. Checkpoint I/O (Wall 19: MFU Penalty)
*Implemented in [`mlsysim.core.solver.CheckpointSolver`](api/core.solver.CheckpointSolver.qmd).*
Periodic checkpointing imposes an I/O burst that pauses training:
$$
\text{MFU\_penalty} = \frac{T_{\text{write}}}{T_{\text{interval}}}
$$
where $T_{\text{write}}$ is the time to write one checkpoint (model weights + optimizer state) and $T_{\text{interval}}$ is the checkpoint interval. For Adam optimizer with fp16 weights:
$$
\text{Checkpoint\_Size} = |W| \times (1 + 2 + 2) = 5|W|
$$
The factor of 5 accounts for fp16 weights (1×), fp32 master weights (2×), and fp32 momentum + variance (2×). The optimal checkpoint interval from the Young-Daly formula (Section 10) balances checkpoint overhead against expected rework from failures.
---
::: {.callout-note}
## Limitations of First-Order Models
These equations are first-order analytical models. They assume: