Files
cs249r_book/mlsysim/tutorial/cheatsheet.md
Vijay Janapa Reddi 24ca3b0ef9 feat(mlsysim): complete ISCA tutorial package — backward design + 94 slides
Full-day tutorial package for ISCA 2026:

Tutorial Design (DESIGN.md, 639 lines):
- Backward design with 4 transfer goals, 6 enduring understandings
- 5 designed "aha moments" with predict-then-reveal structure
- Hour-by-hour schedule with 40% hands-on minimum
- Capstone: multi-region fleet design under budget/latency/carbon constraints
- Facilitation notes, energy management, common pitfalls

Slides (3,081 lines LaTeX):
- Parts 0-4: Welcome, Iron Law, Serving, Compression, Distributed
- Parts 5-9: Economics, DSE, TinyML, Advanced, Wrap-up
- ~94 slides with live demos using real mlsysim API calls
- Speaker notes on every exercise slide

Supporting Materials:
- exercises.md: 8 hands-on exercises with expected answers
- cheatsheet.md: single-page reference card (Iron Law + top equations)
- prerequisites.md: setup instructions + troubleshooting FAQ
2026-04-01 19:02:23 -04:00

7.7 KiB

mlsysim Cheat Sheet

Single-page reference for the ISCA tutorial.


The Iron Law of ML Training

Time = FLOPs / (N x Peak_FLOPS x MFU x eta_scaling x Goodput)
Symbol Meaning Typical Range
FLOPs Total operations for the workload 6PD for training (Chinchilla)
N Number of accelerators 1 to 100,000+
Peak_FLOPS Hardware peak (per device) 312 TFLOPS (H100 FP16)
MFU Model FLOPs Utilization 0.30 - 0.55
eta_scaling Scaling efficiency (communication overhead) 0.70 - 0.95
Goodput Fraction of time doing useful work (1 - failures - checkpoints) 0.85 - 0.98

The 5 Key Equations

1. Roofline Bottleneck

T = max(FLOPs / Peak_effective, Bytes / BW_effective)

If compute time > memory time, you are compute-bound (increase FLOPS). If memory time > compute time, you are memory-bound (increase bandwidth).

2. KV-Cache Memory (PagedAttention)

KV_bytes = 2 x L x H_kv x D_head x S x B x bytes_per_param

L = layers, H_kv = KV heads, D_head = head dimension, S = sequence length, B = batch size. Factor of 2 accounts for both Key and Value tensors.

3. Ring AllReduce Communication

T_allreduce = 2(N-1)/N x M/BW + 2(N-1) x alpha

N = workers, M = message bytes, BW = link bandwidth, alpha = per-message latency. As N grows large, volume term approaches 2M/BW (bandwidth-optimal).

4. Chinchilla Scaling Law

C = 6PD        (compute-optimal training cost)
P* = sqrt(C/120)  (optimal parameter count for budget C)

P = parameters, D = tokens, C = total FLOPs. Training is compute-optimal when D ~ 20P (20 tokens per parameter).

5. Carbon Footprint

CO2_kg = Energy_kWh x PUE x Carbon_Intensity_gCO2/kWh / 1000

PUE = Power Usage Effectiveness (1.0 = perfect, 1.1 = typical hyperscale). Carbon intensity varies 100x by region (1 gCO2/kWh hydro vs. 800 gCO2/kWh coal).


Efficiency Parameter Guide

Parameter Description Low Typical High
efficiency MFU (fraction of peak FLOPS achieved) 0.10 0.30-0.50 0.65
mfu Same as efficiency, used in fleet solvers 0.10 0.35 0.55
Batch=1 inference LLM decode (memory-bound) 0.01 0.05 0.15
Batched inference LLM prefill / CNN inference 0.20 0.40 0.60
Training (single node) Typical training loop 0.20 0.40 0.55
Training (distributed) Large cluster with comms 0.15 0.30 0.45
FlashAttention Fused attention kernel 0.50 0.60 0.70
TinyML (MCU) Microcontroller inference 0.05 0.15 0.30

Quick API Reference

1. Single-Node Roofline

from mlsysim import Engine, Hardware, Models

profile = Engine.solve(
    model=Models.Llama3_8B,
    hardware=Hardware.H100,
    batch_size=1,
    precision="fp16",       # "fp32", "fp16", "int8", "int4", "fp8"
    efficiency=0.5,
    is_training=False,      # True for training memory/FLOPs
)
# Returns: PerformanceProfile with .latency, .throughput, .bottleneck,
#          .memory_footprint, .mfu, .energy, .feasible

2. LLM Serving (Prefill + Decode)

from mlsysim import ServingModel, Hardware, Models

result = ServingModel().solve(
    model=Models.Llama3_8B,
    hardware=Hardware.H100,
    seq_len=4096,
    batch_size=32,
    precision="fp16",
)
# Returns: ServingResult with .ttft, .itl, .kv_cache_size,
#          .total_memory_required, .feasible

3. Distributed Training (3D Parallelism)

from mlsysim import DistributedModel, Models, Systems

result = DistributedModel().solve(
    model=Models.Llama3_70B,
    fleet=Systems.Clusters.Research_256,
    batch_size=1024,
    tp_size=8, pp_size=4,
    precision="fp16",
    efficiency=0.4,
    overlap_comm=True,
)
# Returns: DistributedResult with .scaling_efficiency,
#          .step_latency_total, .dp_communication_latency,
#          .bubble_fraction, .effective_throughput, .parallelism

4. Compression (Quantization / Pruning)

from mlsysim import CompressionModel, Hardware, Models

result = CompressionModel().solve(
    model=Models.Llama3_8B,
    hardware=Hardware.H100,
    method="quantization",  # "quantization", "pruning", "distillation"
    target_bitwidth=4,      # 4, 8, 16
)
# Returns: CompressionResult with .compression_ratio, .compressed_size_gb,
#          .memory_savings_pct, .inference_speedup, .estimated_accuracy_delta

5. Sustainability and Economics

from mlsysim import SustainabilityModel, EconomicsModel, Systems, Infra

fleet = Systems.Clusters.Research_256

co2 = SustainabilityModel().solve(fleet, duration_days=30, datacenter=Infra.Quebec, mfu=0.4)
# Returns: SustainabilityResult with .total_energy_kwh, .carbon_footprint_kg,
#          .water_usage_liters, .pue

tco = EconomicsModel().solve(fleet, duration_days=365, mfu=0.4)
# Returns: EconomicsResult with .tco_usd, .capex_usd, .total_opex_usd,
#          .opex_energy_usd, .carbon_footprint_kg

The 22 Walls at a Glance

# Wall One-Liner
1 Compute Peak FLOPS ceiling of a single accelerator
2 Memory HBM capacity and bandwidth ceilings
3 Software Gap between peak and achieved FLOPS (MFU)
4 Serving LLM inference: compute-bound prefill vs. memory-bound decode
5 Batching Static batching wastes memory through KV-cache fragmentation
6 Streaming Wafer-scale shifts bottleneck from HBM to injection interconnect
7 Tail Latency P99 latency grows non-linearly as utilization approaches 1.0
8 Ingestion Storage I/O must supply data at the rate the accelerator consumes it
9 Transformation CPU preprocessing cannot keep pace with accelerator throughput
10 Locality Network topology limits bisection bandwidth between nodes
11 Complexity Chinchilla scaling laws govern compute-optimal training
12 Reasoning Inference-time compute scales with reasoning chain length
13 Fidelity Compression trades model fidelity for efficiency
14 Communication Distributed training requires synchronization across N nodes
15 Fragility Component failures are inevitable at scale (MTBF/N)
16 Multi-tenant Shared clusters introduce queueing delays
17 Capital Total cost of ownership bounds what is economically feasible
18 Sustainability Energy consumption converts to carbon and water footprint
19 Checkpoint Periodic state saves impose I/O burst penalties on training MFU
20 Safety Privacy and fairness guarantees impose computational overhead
21 Sensitivity Identifies the binding constraint via numerical partial derivatives
22 Synthesis Inverse Roofline: derive hardware specs from an SLA target

Hardware Quick Reference

Accelerator Peak FP16 (TFLOPS) HBM (GiB) BW (TB/s) TDP (W)
V100 125 32 0.9 300
A100 312 80 2.0 400
H100 989 80 3.35 700
H200 989 141* 4.8 700
B200 2,250 192 8.0 1,000
nRF52840 0.000064 0.001 (1 MB flash) 0.000064 0.015

*H200 capacity listed as 141 GB in registry (non-binary).

Access via: Hardware.A100, Hardware.H100, Hardware.Tiny.nRF52840, etc.


Model Quick Reference

Model Parameters Architecture Access
ResNet-50 25.6M CNN Models.ResNet50
Llama-3-8B 8.03B Transformer Models.Llama3_8B
Llama-3-70B 70.6B Transformer Models.Llama3_70B
GPT-3 175B Transformer Models.GPT3
DS-CNN (KWS) 26K CNN Models.Tiny.DS_CNN
MobileNetV2 3.4M CNN Models.MobileNetV2