mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-06 17:49:07 -05:00

Files

Vijay Janapa Reddi 24ca3b0ef9 feat(mlsysim): complete ISCA tutorial package — backward design + 94 slides

Full-day tutorial package for ISCA 2026:

Tutorial Design (DESIGN.md, 639 lines):
- Backward design with 4 transfer goals, 6 enduring understandings
- 5 designed "aha moments" with predict-then-reveal structure
- Hour-by-hour schedule with 40% hands-on minimum
- Capstone: multi-region fleet design under budget/latency/carbon constraints
- Facilitation notes, energy management, common pitfalls

Slides (3,081 lines LaTeX):
- Parts 0-4: Welcome, Iron Law, Serving, Compression, Distributed
- Parts 5-9: Economics, DSE, TinyML, Advanced, Wrap-up
- ~94 slides with live demos using real mlsysim API calls
- Speaker notes on every exercise slide

Supporting Materials:
- exercises.md: 8 hands-on exercises with expected answers
- cheatsheet.md: single-page reference card (Iron Law + top equations)
- prerequisites.md: setup instructions + troubleshooting FAQ

2026-04-01 19:02:23 -04:00

7.7 KiB

Raw Permalink Blame History

mlsysim Cheat Sheet

Single-page reference for the ISCA tutorial.

The Iron Law of ML Training

Time = FLOPs / (N x Peak_FLOPS x MFU x eta_scaling x Goodput)

Symbol	Meaning	Typical Range
FLOPs	Total operations for the workload	6PD for training (Chinchilla)
N	Number of accelerators	1 to 100,000+
Peak_FLOPS	Hardware peak (per device)	312 TFLOPS (H100 FP16)
MFU	Model FLOPs Utilization	0.30 - 0.55
eta_scaling	Scaling efficiency (communication overhead)	0.70 - 0.95
Goodput	Fraction of time doing useful work (1 - failures - checkpoints)	0.85 - 0.98

The 5 Key Equations

1. Roofline Bottleneck

T = max(FLOPs / Peak_effective, Bytes / BW_effective)

If compute time > memory time, you are compute-bound (increase FLOPS). If memory time > compute time, you are memory-bound (increase bandwidth).

2. KV-Cache Memory (PagedAttention)

KV_bytes = 2 x L x H_kv x D_head x S x B x bytes_per_param

L = layers, H_kv = KV heads, D_head = head dimension, S = sequence length, B = batch size. Factor of 2 accounts for both Key and Value tensors.

3. Ring AllReduce Communication

T_allreduce = 2(N-1)/N x M/BW + 2(N-1) x alpha

N = workers, M = message bytes, BW = link bandwidth, alpha = per-message latency. As N grows large, volume term approaches 2M/BW (bandwidth-optimal).

4. Chinchilla Scaling Law

C = 6PD        (compute-optimal training cost)
P* = sqrt(C/120)  (optimal parameter count for budget C)

P = parameters, D = tokens, C = total FLOPs. Training is compute-optimal when D ~ 20P (20 tokens per parameter).

5. Carbon Footprint

CO2_kg = Energy_kWh x PUE x Carbon_Intensity_gCO2/kWh / 1000

PUE = Power Usage Effectiveness (1.0 = perfect, 1.1 = typical hyperscale). Carbon intensity varies 100x by region (1 gCO2/kWh hydro vs. 800 gCO2/kWh coal).

Efficiency Parameter Guide

Parameter	Description	Low	Typical	High
`efficiency`	MFU (fraction of peak FLOPS achieved)	0.10	0.30-0.50	0.65
`mfu`	Same as efficiency, used in fleet solvers	0.10	0.35	0.55
Batch=1 inference	LLM decode (memory-bound)	0.01	0.05	0.15
Batched inference	LLM prefill / CNN inference	0.20	0.40	0.60
Training (single node)	Typical training loop	0.20	0.40	0.55
Training (distributed)	Large cluster with comms	0.15	0.30	0.45
FlashAttention	Fused attention kernel	0.50	0.60	0.70
TinyML (MCU)	Microcontroller inference	0.05	0.15	0.30

Quick API Reference

1. Single-Node Roofline

from mlsysim import Engine, Hardware, Models

profile = Engine.solve(
    model=Models.Llama3_8B,
    hardware=Hardware.H100,
    batch_size=1,
    precision="fp16",       # "fp32", "fp16", "int8", "int4", "fp8"
    efficiency=0.5,
    is_training=False,      # True for training memory/FLOPs
)
# Returns: PerformanceProfile with .latency, .throughput, .bottleneck,
#          .memory_footprint, .mfu, .energy, .feasible

2. LLM Serving (Prefill + Decode)

from mlsysim import ServingModel, Hardware, Models

result = ServingModel().solve(
    model=Models.Llama3_8B,
    hardware=Hardware.H100,
    seq_len=4096,
    batch_size=32,
    precision="fp16",
)
# Returns: ServingResult with .ttft, .itl, .kv_cache_size,
#          .total_memory_required, .feasible

3. Distributed Training (3D Parallelism)

from mlsysim import DistributedModel, Models, Systems

result = DistributedModel().solve(
    model=Models.Llama3_70B,
    fleet=Systems.Clusters.Research_256,
    batch_size=1024,
    tp_size=8, pp_size=4,
    precision="fp16",
    efficiency=0.4,
    overlap_comm=True,
)
# Returns: DistributedResult with .scaling_efficiency,
#          .step_latency_total, .dp_communication_latency,
#          .bubble_fraction, .effective_throughput, .parallelism

4. Compression (Quantization / Pruning)

from mlsysim import CompressionModel, Hardware, Models

result = CompressionModel().solve(
    model=Models.Llama3_8B,
    hardware=Hardware.H100,
    method="quantization",  # "quantization", "pruning", "distillation"
    target_bitwidth=4,      # 4, 8, 16
)
# Returns: CompressionResult with .compression_ratio, .compressed_size_gb,
#          .memory_savings_pct, .inference_speedup, .estimated_accuracy_delta

5. Sustainability and Economics

from mlsysim import SustainabilityModel, EconomicsModel, Systems, Infra

fleet = Systems.Clusters.Research_256

co2 = SustainabilityModel().solve(fleet, duration_days=30, datacenter=Infra.Quebec, mfu=0.4)
# Returns: SustainabilityResult with .total_energy_kwh, .carbon_footprint_kg,
#          .water_usage_liters, .pue

tco = EconomicsModel().solve(fleet, duration_days=365, mfu=0.4)
# Returns: EconomicsResult with .tco_usd, .capex_usd, .total_opex_usd,
#          .opex_energy_usd, .carbon_footprint_kg

The 22 Walls at a Glance

#	Wall	One-Liner
1	Compute	Peak FLOPS ceiling of a single accelerator
2	Memory	HBM capacity and bandwidth ceilings
3	Software	Gap between peak and achieved FLOPS (MFU)
4	Serving	LLM inference: compute-bound prefill vs. memory-bound decode
5	Batching	Static batching wastes memory through KV-cache fragmentation
6	Streaming	Wafer-scale shifts bottleneck from HBM to injection interconnect
7	Tail Latency	P99 latency grows non-linearly as utilization approaches 1.0
8	Ingestion	Storage I/O must supply data at the rate the accelerator consumes it
9	Transformation	CPU preprocessing cannot keep pace with accelerator throughput
10	Locality	Network topology limits bisection bandwidth between nodes
11	Complexity	Chinchilla scaling laws govern compute-optimal training
12	Reasoning	Inference-time compute scales with reasoning chain length
13	Fidelity	Compression trades model fidelity for efficiency
14	Communication	Distributed training requires synchronization across N nodes
15	Fragility	Component failures are inevitable at scale (MTBF/N)
16	Multi-tenant	Shared clusters introduce queueing delays
17	Capital	Total cost of ownership bounds what is economically feasible
18	Sustainability	Energy consumption converts to carbon and water footprint
19	Checkpoint	Periodic state saves impose I/O burst penalties on training MFU
20	Safety	Privacy and fairness guarantees impose computational overhead
21	Sensitivity	Identifies the binding constraint via numerical partial derivatives
22	Synthesis	Inverse Roofline: derive hardware specs from an SLA target

Hardware Quick Reference

Accelerator	Peak FP16 (TFLOPS)	HBM (GiB)	BW (TB/s)	TDP (W)
V100	125	32	0.9	300
A100	312	80	2.0	400
H100	989	80	3.35	700
H200	989	141*	4.8	700
B200	2,250	192	8.0	1,000
nRF52840	0.000064	0.001 (1 MB flash)	0.000064	0.015

*H200 capacity listed as 141 GB in registry (non-binary).

Access via: Hardware.A100, Hardware.H100, Hardware.Tiny.nRF52840, etc.

Model Quick Reference

Model	Parameters	Architecture	Access
ResNet-50	25.6M	CNN	`Models.ResNet50`
Llama-3-8B	8.03B	Transformer	`Models.Llama3_8B`
Llama-3-70B	70.6B	Transformer	`Models.Llama3_70B`
GPT-3	175B	Transformer	`Models.GPT3`
DS-CNN (KWS)	26K	CNN	`Models.Tiny.DS_CNN`
MobileNetV2	3.4M	CNN	`Models.MobileNetV2`

7.7 KiB Raw Permalink Blame History