Full-day tutorial package for ISCA 2026: Tutorial Design (DESIGN.md, 639 lines): - Backward design with 4 transfer goals, 6 enduring understandings - 5 designed "aha moments" with predict-then-reveal structure - Hour-by-hour schedule with 40% hands-on minimum - Capstone: multi-region fleet design under budget/latency/carbon constraints - Facilitation notes, energy management, common pitfalls Slides (3,081 lines LaTeX): - Parts 0-4: Welcome, Iron Law, Serving, Compression, Distributed - Parts 5-9: Economics, DSE, TinyML, Advanced, Wrap-up - ~94 slides with live demos using real mlsysim API calls - Speaker notes on every exercise slide Supporting Materials: - exercises.md: 8 hands-on exercises with expected answers - cheatsheet.md: single-page reference card (Iron Law + top equations) - prerequisites.md: setup instructions + troubleshooting FAQ
7.7 KiB
mlsysim Cheat Sheet
Single-page reference for the ISCA tutorial.
The Iron Law of ML Training
Time = FLOPs / (N x Peak_FLOPS x MFU x eta_scaling x Goodput)
| Symbol | Meaning | Typical Range |
|---|---|---|
| FLOPs | Total operations for the workload | 6PD for training (Chinchilla) |
| N | Number of accelerators | 1 to 100,000+ |
| Peak_FLOPS | Hardware peak (per device) | 312 TFLOPS (H100 FP16) |
| MFU | Model FLOPs Utilization | 0.30 - 0.55 |
| eta_scaling | Scaling efficiency (communication overhead) | 0.70 - 0.95 |
| Goodput | Fraction of time doing useful work (1 - failures - checkpoints) | 0.85 - 0.98 |
The 5 Key Equations
1. Roofline Bottleneck
T = max(FLOPs / Peak_effective, Bytes / BW_effective)
If compute time > memory time, you are compute-bound (increase FLOPS). If memory time > compute time, you are memory-bound (increase bandwidth).
2. KV-Cache Memory (PagedAttention)
KV_bytes = 2 x L x H_kv x D_head x S x B x bytes_per_param
L = layers, H_kv = KV heads, D_head = head dimension, S = sequence length, B = batch size. Factor of 2 accounts for both Key and Value tensors.
3. Ring AllReduce Communication
T_allreduce = 2(N-1)/N x M/BW + 2(N-1) x alpha
N = workers, M = message bytes, BW = link bandwidth, alpha = per-message latency. As N grows large, volume term approaches 2M/BW (bandwidth-optimal).
4. Chinchilla Scaling Law
C = 6PD (compute-optimal training cost)
P* = sqrt(C/120) (optimal parameter count for budget C)
P = parameters, D = tokens, C = total FLOPs. Training is compute-optimal when D ~ 20P (20 tokens per parameter).
5. Carbon Footprint
CO2_kg = Energy_kWh x PUE x Carbon_Intensity_gCO2/kWh / 1000
PUE = Power Usage Effectiveness (1.0 = perfect, 1.1 = typical hyperscale). Carbon intensity varies 100x by region (1 gCO2/kWh hydro vs. 800 gCO2/kWh coal).
Efficiency Parameter Guide
| Parameter | Description | Low | Typical | High |
|---|---|---|---|---|
efficiency |
MFU (fraction of peak FLOPS achieved) | 0.10 | 0.30-0.50 | 0.65 |
mfu |
Same as efficiency, used in fleet solvers | 0.10 | 0.35 | 0.55 |
| Batch=1 inference | LLM decode (memory-bound) | 0.01 | 0.05 | 0.15 |
| Batched inference | LLM prefill / CNN inference | 0.20 | 0.40 | 0.60 |
| Training (single node) | Typical training loop | 0.20 | 0.40 | 0.55 |
| Training (distributed) | Large cluster with comms | 0.15 | 0.30 | 0.45 |
| FlashAttention | Fused attention kernel | 0.50 | 0.60 | 0.70 |
| TinyML (MCU) | Microcontroller inference | 0.05 | 0.15 | 0.30 |
Quick API Reference
1. Single-Node Roofline
from mlsysim import Engine, Hardware, Models
profile = Engine.solve(
model=Models.Llama3_8B,
hardware=Hardware.H100,
batch_size=1,
precision="fp16", # "fp32", "fp16", "int8", "int4", "fp8"
efficiency=0.5,
is_training=False, # True for training memory/FLOPs
)
# Returns: PerformanceProfile with .latency, .throughput, .bottleneck,
# .memory_footprint, .mfu, .energy, .feasible
2. LLM Serving (Prefill + Decode)
from mlsysim import ServingModel, Hardware, Models
result = ServingModel().solve(
model=Models.Llama3_8B,
hardware=Hardware.H100,
seq_len=4096,
batch_size=32,
precision="fp16",
)
# Returns: ServingResult with .ttft, .itl, .kv_cache_size,
# .total_memory_required, .feasible
3. Distributed Training (3D Parallelism)
from mlsysim import DistributedModel, Models, Systems
result = DistributedModel().solve(
model=Models.Llama3_70B,
fleet=Systems.Clusters.Research_256,
batch_size=1024,
tp_size=8, pp_size=4,
precision="fp16",
efficiency=0.4,
overlap_comm=True,
)
# Returns: DistributedResult with .scaling_efficiency,
# .step_latency_total, .dp_communication_latency,
# .bubble_fraction, .effective_throughput, .parallelism
4. Compression (Quantization / Pruning)
from mlsysim import CompressionModel, Hardware, Models
result = CompressionModel().solve(
model=Models.Llama3_8B,
hardware=Hardware.H100,
method="quantization", # "quantization", "pruning", "distillation"
target_bitwidth=4, # 4, 8, 16
)
# Returns: CompressionResult with .compression_ratio, .compressed_size_gb,
# .memory_savings_pct, .inference_speedup, .estimated_accuracy_delta
5. Sustainability and Economics
from mlsysim import SustainabilityModel, EconomicsModel, Systems, Infra
fleet = Systems.Clusters.Research_256
co2 = SustainabilityModel().solve(fleet, duration_days=30, datacenter=Infra.Quebec, mfu=0.4)
# Returns: SustainabilityResult with .total_energy_kwh, .carbon_footprint_kg,
# .water_usage_liters, .pue
tco = EconomicsModel().solve(fleet, duration_days=365, mfu=0.4)
# Returns: EconomicsResult with .tco_usd, .capex_usd, .total_opex_usd,
# .opex_energy_usd, .carbon_footprint_kg
The 22 Walls at a Glance
| # | Wall | One-Liner |
|---|---|---|
| 1 | Compute | Peak FLOPS ceiling of a single accelerator |
| 2 | Memory | HBM capacity and bandwidth ceilings |
| 3 | Software | Gap between peak and achieved FLOPS (MFU) |
| 4 | Serving | LLM inference: compute-bound prefill vs. memory-bound decode |
| 5 | Batching | Static batching wastes memory through KV-cache fragmentation |
| 6 | Streaming | Wafer-scale shifts bottleneck from HBM to injection interconnect |
| 7 | Tail Latency | P99 latency grows non-linearly as utilization approaches 1.0 |
| 8 | Ingestion | Storage I/O must supply data at the rate the accelerator consumes it |
| 9 | Transformation | CPU preprocessing cannot keep pace with accelerator throughput |
| 10 | Locality | Network topology limits bisection bandwidth between nodes |
| 11 | Complexity | Chinchilla scaling laws govern compute-optimal training |
| 12 | Reasoning | Inference-time compute scales with reasoning chain length |
| 13 | Fidelity | Compression trades model fidelity for efficiency |
| 14 | Communication | Distributed training requires synchronization across N nodes |
| 15 | Fragility | Component failures are inevitable at scale (MTBF/N) |
| 16 | Multi-tenant | Shared clusters introduce queueing delays |
| 17 | Capital | Total cost of ownership bounds what is economically feasible |
| 18 | Sustainability | Energy consumption converts to carbon and water footprint |
| 19 | Checkpoint | Periodic state saves impose I/O burst penalties on training MFU |
| 20 | Safety | Privacy and fairness guarantees impose computational overhead |
| 21 | Sensitivity | Identifies the binding constraint via numerical partial derivatives |
| 22 | Synthesis | Inverse Roofline: derive hardware specs from an SLA target |
Hardware Quick Reference
| Accelerator | Peak FP16 (TFLOPS) | HBM (GiB) | BW (TB/s) | TDP (W) |
|---|---|---|---|---|
| V100 | 125 | 32 | 0.9 | 300 |
| A100 | 312 | 80 | 2.0 | 400 |
| H100 | 989 | 80 | 3.35 | 700 |
| H200 | 989 | 141* | 4.8 | 700 |
| B200 | 2,250 | 192 | 8.0 | 1,000 |
| nRF52840 | 0.000064 | 0.001 (1 MB flash) | 0.000064 | 0.015 |
*H200 capacity listed as 141 GB in registry (non-binary).
Access via: Hardware.A100, Hardware.H100, Hardware.Tiny.nRF52840, etc.
Model Quick Reference
| Model | Parameters | Architecture | Access |
|---|---|---|---|
| ResNet-50 | 25.6M | CNN | Models.ResNet50 |
| Llama-3-8B | 8.03B | Transformer | Models.Llama3_8B |
| Llama-3-70B | 70.6B | Transformer | Models.Llama3_70B |
| GPT-3 | 175B | Transformer | Models.GPT3 |
| DS-CNN (KWS) | 26K | CNN | Models.Tiny.DS_CNN |
| MobileNetV2 | 3.4M | CNN | Models.MobileNetV2 |