cs249r_book/mlsysim/tutorial/cheatsheet.md

# mlsysim Cheat Sheet

Single-page reference for the ISCA tutorial.

---

## The Iron Law of ML Training

```
Time = FLOPs / (N x Peak_FLOPS x MFU x eta_scaling x Goodput)
```

| Symbol | Meaning | Typical Range |
|--------|---------|---------------|
| FLOPs | Total operations for the workload | 6PD for training (Chinchilla) |
| N | Number of accelerators | 1 to 100,000+ |
| Peak_FLOPS | Hardware peak (per device) | 312 TFLOPS (H100 FP16) |
| MFU | Model FLOPs Utilization | 0.30 - 0.55 |
| eta_scaling | Scaling efficiency (communication overhead) | 0.70 - 0.95 |
| Goodput | Fraction of time doing useful work (1 - failures - checkpoints) | 0.85 - 0.98 |

---

## The 5 Key Equations

### 1. Roofline Bottleneck

```
T = max(FLOPs / Peak_effective, Bytes / BW_effective)
```

If compute time > memory time, you are **compute-bound** (increase FLOPS).
If memory time > compute time, you are **memory-bound** (increase bandwidth).

### 2. KV-Cache Memory (PagedAttention)

```
KV_bytes = 2 x L x H_kv x D_head x S x B x bytes_per_param
```

L = layers, H_kv = KV heads, D_head = head dimension, S = sequence length, B = batch size.
Factor of 2 accounts for both Key and Value tensors.

### 3. Ring AllReduce Communication

```
T_allreduce = 2(N-1)/N x M/BW + 2(N-1) x alpha
```

N = workers, M = message bytes, BW = link bandwidth, alpha = per-message latency.
As N grows large, volume term approaches 2M/BW (bandwidth-optimal).

### 4. Chinchilla Scaling Law

```
C = 6PD        (compute-optimal training cost)
P* = sqrt(C/120)  (optimal parameter count for budget C)
```

P = parameters, D = tokens, C = total FLOPs. Training is compute-optimal when
D ~ 20P (20 tokens per parameter).

### 5. Carbon Footprint

```
CO2_kg = Energy_kWh x PUE x Carbon_Intensity_gCO2/kWh / 1000
```

PUE = Power Usage Effectiveness (1.0 = perfect, 1.1 = typical hyperscale).
Carbon intensity varies 100x by region (1 gCO2/kWh hydro vs. 800 gCO2/kWh coal).

---

## Efficiency Parameter Guide

| Parameter | Description | Low | Typical | High |
|-----------|-------------|-----|---------|------|
| `efficiency` | MFU (fraction of peak FLOPS achieved) | 0.10 | 0.30-0.50 | 0.65 |
| `mfu` | Same as efficiency, used in fleet solvers | 0.10 | 0.35 | 0.55 |
| Batch=1 inference | LLM decode (memory-bound) | 0.01 | 0.05 | 0.15 |
| Batched inference | LLM prefill / CNN inference | 0.20 | 0.40 | 0.60 |
| Training (single node) | Typical training loop | 0.20 | 0.40 | 0.55 |
| Training (distributed) | Large cluster with comms | 0.15 | 0.30 | 0.45 |
| FlashAttention | Fused attention kernel | 0.50 | 0.60 | 0.70 |
| TinyML (MCU) | Microcontroller inference | 0.05 | 0.15 | 0.30 |

---

## Quick API Reference

### 1. Single-Node Roofline

```python
from mlsysim import Engine, Hardware, Models

profile = Engine.solve(
    model=Models.Llama3_8B,
    hardware=Hardware.H100,
    batch_size=1,
    precision="fp16",       # "fp32", "fp16", "int8", "int4", "fp8"
    efficiency=0.5,
    is_training=False,      # True for training memory/FLOPs
)
# Returns: PerformanceProfile with .latency, .throughput, .bottleneck,
#          .memory_footprint, .mfu, .energy, .feasible
```

### 2. LLM Serving (Prefill + Decode)

```python
from mlsysim import ServingModel, Hardware, Models

result = ServingModel().solve(
    model=Models.Llama3_8B,
    hardware=Hardware.H100,
    seq_len=4096,
    batch_size=32,
    precision="fp16",
)
# Returns: ServingResult with .ttft, .itl, .kv_cache_size,
#          .total_memory_required, .feasible
```

### 3. Distributed Training (3D Parallelism)

```python
from mlsysim import DistributedModel, Models, Systems

result = DistributedModel().solve(
    model=Models.Llama3_70B,
    fleet=Systems.Clusters.Research_256,
    batch_size=1024,
    tp_size=8, pp_size=4,
    precision="fp16",
    efficiency=0.4,
    overlap_comm=True,
)
# Returns: DistributedResult with .scaling_efficiency,
#          .step_latency_total, .dp_communication_latency,
#          .bubble_fraction, .effective_throughput, .parallelism
```

### 4. Compression (Quantization / Pruning)

```python
from mlsysim import CompressionModel, Hardware, Models

result = CompressionModel().solve(
    model=Models.Llama3_8B,
    hardware=Hardware.H100,
    method="quantization",  # "quantization", "pruning", "distillation"
    target_bitwidth=4,      # 4, 8, 16
)
# Returns: CompressionResult with .compression_ratio, .compressed_size_gb,
#          .memory_savings_pct, .inference_speedup, .estimated_accuracy_delta
```

### 5. Sustainability and Economics

```python
from mlsysim import SustainabilityModel, EconomicsModel, Systems, Infra

fleet = Systems.Clusters.Research_256

co2 = SustainabilityModel().solve(fleet, duration_days=30, datacenter=Infra.Quebec, mfu=0.4)
# Returns: SustainabilityResult with .total_energy_kwh, .carbon_footprint_kg,
#          .water_usage_liters, .pue

tco = EconomicsModel().solve(fleet, duration_days=365, mfu=0.4)
# Returns: EconomicsResult with .tco_usd, .capex_usd, .total_opex_usd,
#          .opex_energy_usd, .carbon_footprint_kg
```

---

## The 22 Walls at a Glance

| # | Wall | One-Liner |
|---|------|-----------|
| 1 | Compute | Peak FLOPS ceiling of a single accelerator |
| 2 | Memory | HBM capacity and bandwidth ceilings |
| 3 | Software | Gap between peak and achieved FLOPS (MFU) |
| 4 | Serving | LLM inference: compute-bound prefill vs. memory-bound decode |
| 5 | Batching | Static batching wastes memory through KV-cache fragmentation |
| 6 | Streaming | Wafer-scale shifts bottleneck from HBM to injection interconnect |
| 7 | Tail Latency | P99 latency grows non-linearly as utilization approaches 1.0 |
| 8 | Ingestion | Storage I/O must supply data at the rate the accelerator consumes it |
| 9 | Transformation | CPU preprocessing cannot keep pace with accelerator throughput |
| 10 | Locality | Network topology limits bisection bandwidth between nodes |
| 11 | Complexity | Chinchilla scaling laws govern compute-optimal training |
| 12 | Reasoning | Inference-time compute scales with reasoning chain length |
| 13 | Fidelity | Compression trades model fidelity for efficiency |
| 14 | Communication | Distributed training requires synchronization across N nodes |
| 15 | Fragility | Component failures are inevitable at scale (MTBF/N) |
| 16 | Multi-tenant | Shared clusters introduce queueing delays |
| 17 | Capital | Total cost of ownership bounds what is economically feasible |
| 18 | Sustainability | Energy consumption converts to carbon and water footprint |
| 19 | Checkpoint | Periodic state saves impose I/O burst penalties on training MFU |
| 20 | Safety | Privacy and fairness guarantees impose computational overhead |
| 21 | Sensitivity | Identifies the binding constraint via numerical partial derivatives |
| 22 | Synthesis | Inverse Roofline: derive hardware specs from an SLA target |

---

## Hardware Quick Reference

| Accelerator | Peak FP16 (TFLOPS) | HBM (GiB) | BW (TB/s) | TDP (W) |
|-------------|-------------------|------------|-----------|---------|
| V100 | 125 | 32 | 0.9 | 300 |
| A100 | 312 | 80 | 2.0 | 400 |
| H100 | 989 | 80 | 3.35 | 700 |
| H200 | 989 | 141* | 4.8 | 700 |
| B200 | 2,250 | 192 | 8.0 | 1,000 |
| nRF52840 | 0.000064 | 0.001 (1 MB flash) | 0.000064 | 0.015 |

*H200 capacity listed as 141 GB in registry (non-binary).

Access via: `Hardware.A100`, `Hardware.H100`, `Hardware.Tiny.nRF52840`, etc.

---

## Model Quick Reference

| Model | Parameters | Architecture | Access |
|-------|-----------|--------------|--------|
| ResNet-50 | 25.6M | CNN | `Models.ResNet50` |
| Llama-3-8B | 8.03B | Transformer | `Models.Llama3_8B` |
| Llama-3-70B | 70.6B | Transformer | `Models.Llama3_70B` |
| GPT-3 | 175B | Transformer | `Models.GPT3` |
| DS-CNN (KWS) | 26K | CNN | `Models.Tiny.DS_CNN` |
| MobileNetV2 | 3.4M | CNN | `Models.MobileNetV2` |