mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-06 17:49:07 -05:00
Full-day tutorial package for ISCA 2026: Tutorial Design (DESIGN.md, 639 lines): - Backward design with 4 transfer goals, 6 enduring understandings - 5 designed "aha moments" with predict-then-reveal structure - Hour-by-hour schedule with 40% hands-on minimum - Capstone: multi-region fleet design under budget/latency/carbon constraints - Facilitation notes, energy management, common pitfalls Slides (3,081 lines LaTeX): - Parts 0-4: Welcome, Iron Law, Serving, Compression, Distributed - Parts 5-9: Economics, DSE, TinyML, Advanced, Wrap-up - ~94 slides with live demos using real mlsysim API calls - Speaker notes on every exercise slide Supporting Materials: - exercises.md: 8 hands-on exercises with expected answers - cheatsheet.md: single-page reference card (Iron Law + top equations) - prerequisites.md: setup instructions + troubleshooting FAQ
232 lines
7.7 KiB
Markdown
232 lines
7.7 KiB
Markdown
# mlsysim Cheat Sheet
|
|
|
|
Single-page reference for the ISCA tutorial.
|
|
|
|
---
|
|
|
|
## The Iron Law of ML Training
|
|
|
|
```
|
|
Time = FLOPs / (N x Peak_FLOPS x MFU x eta_scaling x Goodput)
|
|
```
|
|
|
|
| Symbol | Meaning | Typical Range |
|
|
|--------|---------|---------------|
|
|
| FLOPs | Total operations for the workload | 6PD for training (Chinchilla) |
|
|
| N | Number of accelerators | 1 to 100,000+ |
|
|
| Peak_FLOPS | Hardware peak (per device) | 312 TFLOPS (H100 FP16) |
|
|
| MFU | Model FLOPs Utilization | 0.30 - 0.55 |
|
|
| eta_scaling | Scaling efficiency (communication overhead) | 0.70 - 0.95 |
|
|
| Goodput | Fraction of time doing useful work (1 - failures - checkpoints) | 0.85 - 0.98 |
|
|
|
|
---
|
|
|
|
## The 5 Key Equations
|
|
|
|
### 1. Roofline Bottleneck
|
|
|
|
```
|
|
T = max(FLOPs / Peak_effective, Bytes / BW_effective)
|
|
```
|
|
|
|
If compute time > memory time, you are **compute-bound** (increase FLOPS).
|
|
If memory time > compute time, you are **memory-bound** (increase bandwidth).
|
|
|
|
### 2. KV-Cache Memory (PagedAttention)
|
|
|
|
```
|
|
KV_bytes = 2 x L x H_kv x D_head x S x B x bytes_per_param
|
|
```
|
|
|
|
L = layers, H_kv = KV heads, D_head = head dimension, S = sequence length, B = batch size.
|
|
Factor of 2 accounts for both Key and Value tensors.
|
|
|
|
### 3. Ring AllReduce Communication
|
|
|
|
```
|
|
T_allreduce = 2(N-1)/N x M/BW + 2(N-1) x alpha
|
|
```
|
|
|
|
N = workers, M = message bytes, BW = link bandwidth, alpha = per-message latency.
|
|
As N grows large, volume term approaches 2M/BW (bandwidth-optimal).
|
|
|
|
### 4. Chinchilla Scaling Law
|
|
|
|
```
|
|
C = 6PD (compute-optimal training cost)
|
|
P* = sqrt(C/120) (optimal parameter count for budget C)
|
|
```
|
|
|
|
P = parameters, D = tokens, C = total FLOPs. Training is compute-optimal when
|
|
D ~ 20P (20 tokens per parameter).
|
|
|
|
### 5. Carbon Footprint
|
|
|
|
```
|
|
CO2_kg = Energy_kWh x PUE x Carbon_Intensity_gCO2/kWh / 1000
|
|
```
|
|
|
|
PUE = Power Usage Effectiveness (1.0 = perfect, 1.1 = typical hyperscale).
|
|
Carbon intensity varies 100x by region (1 gCO2/kWh hydro vs. 800 gCO2/kWh coal).
|
|
|
|
---
|
|
|
|
## Efficiency Parameter Guide
|
|
|
|
| Parameter | Description | Low | Typical | High |
|
|
|-----------|-------------|-----|---------|------|
|
|
| `efficiency` | MFU (fraction of peak FLOPS achieved) | 0.10 | 0.30-0.50 | 0.65 |
|
|
| `mfu` | Same as efficiency, used in fleet solvers | 0.10 | 0.35 | 0.55 |
|
|
| Batch=1 inference | LLM decode (memory-bound) | 0.01 | 0.05 | 0.15 |
|
|
| Batched inference | LLM prefill / CNN inference | 0.20 | 0.40 | 0.60 |
|
|
| Training (single node) | Typical training loop | 0.20 | 0.40 | 0.55 |
|
|
| Training (distributed) | Large cluster with comms | 0.15 | 0.30 | 0.45 |
|
|
| FlashAttention | Fused attention kernel | 0.50 | 0.60 | 0.70 |
|
|
| TinyML (MCU) | Microcontroller inference | 0.05 | 0.15 | 0.30 |
|
|
|
|
---
|
|
|
|
## Quick API Reference
|
|
|
|
### 1. Single-Node Roofline
|
|
|
|
```python
|
|
from mlsysim import Engine, Hardware, Models
|
|
|
|
profile = Engine.solve(
|
|
model=Models.Llama3_8B,
|
|
hardware=Hardware.H100,
|
|
batch_size=1,
|
|
precision="fp16", # "fp32", "fp16", "int8", "int4", "fp8"
|
|
efficiency=0.5,
|
|
is_training=False, # True for training memory/FLOPs
|
|
)
|
|
# Returns: PerformanceProfile with .latency, .throughput, .bottleneck,
|
|
# .memory_footprint, .mfu, .energy, .feasible
|
|
```
|
|
|
|
### 2. LLM Serving (Prefill + Decode)
|
|
|
|
```python
|
|
from mlsysim import ServingModel, Hardware, Models
|
|
|
|
result = ServingModel().solve(
|
|
model=Models.Llama3_8B,
|
|
hardware=Hardware.H100,
|
|
seq_len=4096,
|
|
batch_size=32,
|
|
precision="fp16",
|
|
)
|
|
# Returns: ServingResult with .ttft, .itl, .kv_cache_size,
|
|
# .total_memory_required, .feasible
|
|
```
|
|
|
|
### 3. Distributed Training (3D Parallelism)
|
|
|
|
```python
|
|
from mlsysim import DistributedModel, Models, Systems
|
|
|
|
result = DistributedModel().solve(
|
|
model=Models.Llama3_70B,
|
|
fleet=Systems.Clusters.Research_256,
|
|
batch_size=1024,
|
|
tp_size=8, pp_size=4,
|
|
precision="fp16",
|
|
efficiency=0.4,
|
|
overlap_comm=True,
|
|
)
|
|
# Returns: DistributedResult with .scaling_efficiency,
|
|
# .step_latency_total, .dp_communication_latency,
|
|
# .bubble_fraction, .effective_throughput, .parallelism
|
|
```
|
|
|
|
### 4. Compression (Quantization / Pruning)
|
|
|
|
```python
|
|
from mlsysim import CompressionModel, Hardware, Models
|
|
|
|
result = CompressionModel().solve(
|
|
model=Models.Llama3_8B,
|
|
hardware=Hardware.H100,
|
|
method="quantization", # "quantization", "pruning", "distillation"
|
|
target_bitwidth=4, # 4, 8, 16
|
|
)
|
|
# Returns: CompressionResult with .compression_ratio, .compressed_size_gb,
|
|
# .memory_savings_pct, .inference_speedup, .estimated_accuracy_delta
|
|
```
|
|
|
|
### 5. Sustainability and Economics
|
|
|
|
```python
|
|
from mlsysim import SustainabilityModel, EconomicsModel, Systems, Infra
|
|
|
|
fleet = Systems.Clusters.Research_256
|
|
|
|
co2 = SustainabilityModel().solve(fleet, duration_days=30, datacenter=Infra.Quebec, mfu=0.4)
|
|
# Returns: SustainabilityResult with .total_energy_kwh, .carbon_footprint_kg,
|
|
# .water_usage_liters, .pue
|
|
|
|
tco = EconomicsModel().solve(fleet, duration_days=365, mfu=0.4)
|
|
# Returns: EconomicsResult with .tco_usd, .capex_usd, .total_opex_usd,
|
|
# .opex_energy_usd, .carbon_footprint_kg
|
|
```
|
|
|
|
---
|
|
|
|
## The 22 Walls at a Glance
|
|
|
|
| # | Wall | One-Liner |
|
|
|---|------|-----------|
|
|
| 1 | Compute | Peak FLOPS ceiling of a single accelerator |
|
|
| 2 | Memory | HBM capacity and bandwidth ceilings |
|
|
| 3 | Software | Gap between peak and achieved FLOPS (MFU) |
|
|
| 4 | Serving | LLM inference: compute-bound prefill vs. memory-bound decode |
|
|
| 5 | Batching | Static batching wastes memory through KV-cache fragmentation |
|
|
| 6 | Streaming | Wafer-scale shifts bottleneck from HBM to injection interconnect |
|
|
| 7 | Tail Latency | P99 latency grows non-linearly as utilization approaches 1.0 |
|
|
| 8 | Ingestion | Storage I/O must supply data at the rate the accelerator consumes it |
|
|
| 9 | Transformation | CPU preprocessing cannot keep pace with accelerator throughput |
|
|
| 10 | Locality | Network topology limits bisection bandwidth between nodes |
|
|
| 11 | Complexity | Chinchilla scaling laws govern compute-optimal training |
|
|
| 12 | Reasoning | Inference-time compute scales with reasoning chain length |
|
|
| 13 | Fidelity | Compression trades model fidelity for efficiency |
|
|
| 14 | Communication | Distributed training requires synchronization across N nodes |
|
|
| 15 | Fragility | Component failures are inevitable at scale (MTBF/N) |
|
|
| 16 | Multi-tenant | Shared clusters introduce queueing delays |
|
|
| 17 | Capital | Total cost of ownership bounds what is economically feasible |
|
|
| 18 | Sustainability | Energy consumption converts to carbon and water footprint |
|
|
| 19 | Checkpoint | Periodic state saves impose I/O burst penalties on training MFU |
|
|
| 20 | Safety | Privacy and fairness guarantees impose computational overhead |
|
|
| 21 | Sensitivity | Identifies the binding constraint via numerical partial derivatives |
|
|
| 22 | Synthesis | Inverse Roofline: derive hardware specs from an SLA target |
|
|
|
|
---
|
|
|
|
## Hardware Quick Reference
|
|
|
|
| Accelerator | Peak FP16 (TFLOPS) | HBM (GiB) | BW (TB/s) | TDP (W) |
|
|
|-------------|-------------------|------------|-----------|---------|
|
|
| V100 | 125 | 32 | 0.9 | 300 |
|
|
| A100 | 312 | 80 | 2.0 | 400 |
|
|
| H100 | 989 | 80 | 3.35 | 700 |
|
|
| H200 | 989 | 141* | 4.8 | 700 |
|
|
| B200 | 2,250 | 192 | 8.0 | 1,000 |
|
|
| nRF52840 | 0.000064 | 0.001 (1 MB flash) | 0.000064 | 0.015 |
|
|
|
|
*H200 capacity listed as 141 GB in registry (non-binary).
|
|
|
|
Access via: `Hardware.A100`, `Hardware.H100`, `Hardware.Tiny.nRF52840`, etc.
|
|
|
|
---
|
|
|
|
## Model Quick Reference
|
|
|
|
| Model | Parameters | Architecture | Access |
|
|
|-------|-----------|--------------|--------|
|
|
| ResNet-50 | 25.6M | CNN | `Models.ResNet50` |
|
|
| Llama-3-8B | 8.03B | Transformer | `Models.Llama3_8B` |
|
|
| Llama-3-70B | 70.6B | Transformer | `Models.Llama3_70B` |
|
|
| GPT-3 | 175B | Transformer | `Models.GPT3` |
|
|
| DS-CNN (KWS) | 26K | CNN | `Models.Tiny.DS_CNN` |
|
|
| MobileNetV2 | 3.4M | CNN | `Models.MobileNetV2` |
|