Align the MLSys·im code, docs, paper, website, workflows, and lab wheel for the 0.1.1 release. This also fixes runtime/API issues found during release review and prepares the paper PDF plus archive package.
24 KiB
MLSys·im Tutorial: Hands-On Exercises
Eight exercises aligned with the eight tutorial parts. Each exercise is self-contained and takes 5-10 minutes.
Exercise 1 — The Roofline Transition (Part 1: Single-Node Performance)
Learning Objective: Identify the batch size where a CNN workload transitions from memory-bound to compute-bound on a datacenter GPU.
Setup
import mlsysim
from mlsysim import Engine, Hardware, Models
model = Models.ResNet50
hw = Hardware.A100
Task
Sweep batch_size from 1 to 512 (powers of 2). For each, call
Engine.solve() and record the bottleneck field along with the
compute and memory latency components. Find the smallest batch size
where the compute term overtakes the memory term.
for bs in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]:
p = Engine.solve(model, hw, batch_size=bs, precision="fp16", efficiency=0.5)
print(f"bs={bs:>4d} bottleneck={p.bottleneck:<10s} "
f"T_compute={p.latency_compute.to('ms'):~P.3f} "
f"T_memory={p.latency_memory.to('ms'):~P.3f} "
f"throughput={p.throughput:~P.1f}")
Question
At what batch size does ResNet-50's compute time overtake its memory time
on the A100? Compare latency_compute vs. latency_memory to find the
exact crossover point.
Hint
Watch for when latency_compute > latency_memory. The crossover is the
Roofline ridge point. Note that the reported bottleneck field depends
on the overall arithmetic intensity calculation, so inspect both latency
components directly.
Expected Answer
ResNet-50 on the A100 at FP16 with efficiency=0.5 is already compute-bound
at very small batch sizes because of its high FLOP count (~8 GFLOPs per
image) relative to its small weight footprint (~51 MB at FP16). The compute
term scales linearly with batch size while the memory term grows more slowly
(weights loaded once, only activation traffic scales). For CNN workloads
like ResNet-50, the transition may already be at batch_size=1, unlike
Transformer decode which is memory-bound at batch_size=1. Compare with
Models.Llama3_8B to see a memory-bound regime.
Discussion
Why does this transition point matter for production deployment? Compare ResNet-50 (compute-bound even at batch_size=1) with Llama-3-8B (memory-bound at batch_size=1). What hardware characteristic should you optimize for in each case -- peak FLOPS or memory bandwidth?
Exercise 2 — LLM Serving Capacity (Part 2: Serving and Inference)
Learning Objective: Determine how many concurrent LLM requests fit in GPU memory, accounting for both model weights and KV-cache.
Setup
from mlsysim import ServingModel, Hardware, Models
serving = ServingModel()
model = Models.Llama3_8B
hw = Hardware.H100
Task
Sweep batch_size from 1 to 128 at seq_len=4096. Find the maximum
batch size where the serving result is still feasible.
for bs in [1, 2, 4, 8, 16, 32, 64, 128]:
r = serving.solve(model, hw, seq_len=4096, batch_size=bs, precision="fp16")
print(f"bs={bs:>4d} feasible={r.feasible} "
f"mem={r.total_memory_required:~P.1f} "
f"kv_cache={r.kv_cache_size:~P.1f} "
f"TTFT={r.ttft:~P.1f} ITL={r.itl:~P.2f}")
Question
How many concurrent Llama-3-8B requests can a single H100 (80 GB) serve at 4K context in FP16? What is the dominant memory consumer at that maximum batch size -- weights or KV-cache?
Hint
H100 has 80 GB HBM3. Llama-3-8B at FP16 is ~16 GB of weights. Each request's KV-cache at 4096 tokens depends on the model's layer count, head count, and head dimension.
Expected Answer
The maximum concurrent batch size is approximately 128 requests on the H100 (80 GiB = ~85.9 GB). At FP16, Llama-3-8B weights consume ~16 GB. Each request's KV-cache at 4096 tokens is ~0.5 GB, so after reserving ~16 GB for weights, the remaining ~70 GB fits roughly 128 KV-cache slots. At maximum capacity, KV-cache dominates memory usage (~69 GB KV vs. ~16 GB weights). At bs=160, total memory exceeds capacity and becomes infeasible.
Discussion
What happens if you switch from FP16 to INT8 quantization? The weights halve, but the KV-cache also shrinks. Which effect matters more for concurrent serving capacity?
Exercise 3 — Quantization Trade-offs (Part 3: Compression and Efficiency)
Learning Objective: Quantify the memory savings and inference speedup from INT4 quantization, and understand the accuracy trade-off.
Setup
from mlsysim import CompressionModel, Engine, Hardware, Models
compress = CompressionModel()
model = Models.Llama3_8B
hw = Hardware.H100
Task
Compare the FP16 baseline with INT4 quantization. Run the compression
model and then run Engine.solve() at both precisions.
# Compression analysis
result = compress.solve(model, hw, method="quantization", target_bitwidth=4)
print(f"Compression ratio: {result.compression_ratio:.1f}x")
print(f"Original size: {result.original_size_gb:~P.2f}")
print(f"Compressed size: {result.compressed_size_gb:~P.2f}")
print(f"Memory savings: {result.memory_savings_pct:.1f}%")
print(f"Inference speedup: {result.inference_speedup:.2f}x")
print(f"Accuracy delta: {result.estimated_accuracy_delta:.2f}%")
# Roofline comparison
fp16 = Engine.solve(model, hw, batch_size=1, precision="fp16", efficiency=0.5)
int4 = Engine.solve(model, hw, batch_size=1, precision="int4", efficiency=0.5)
print(f"\nFP16 latency: {fp16.latency:~P.2f} bottleneck: {fp16.bottleneck}")
print(f"INT4 latency: {int4.latency:~P.2f} bottleneck: {int4.bottleneck}")
Question
What is the memory savings from FP32 baseline to INT4 for Llama-3-8B? What is the estimated accuracy degradation? Is the speedup closer to 8x or 4x, and why?
Hint
The CompressionModel measures compression ratio from the FP32 baseline
(32-bit), so INT4 gives an 8x ratio on paper (32/4). But inference speedup
depends on whether the workload is compute-bound or memory-bound. At
batch_size=1, LLM inference is memory-bound, so the speedup tracks with
the reduction in bytes moved, not FLOPS saved.
Expected Answer
- Memory savings: ~87.5% (8x compression from FP32 to INT4 baseline), reducing the model from ~32 GB (FP32) to ~4 GB (INT4).
- Accuracy delta: Approximately 2-5% degradation (conservative estimate from the Gholami et al. survey).
- Inference speedup: At batch_size=1, the workload is memory-bound, so the speedup is roughly proportional to the bytes reduction. Compared to FP16 inference (the practical baseline), the speedup from INT4 is ~4x in memory traffic. The exact speedup depends on whether the hardware has native INT4 execution units (B200 does, H100 does not).
Discussion
When would you choose INT8 over INT4? Consider a deployment where you need <1% accuracy loss but also need to fit the model on a single GPU. What is the optimal compression point?
Exercise 4 — Parallelism Strategy Search (Part 4: Distributed Training)
Learning Objective: Find the optimal 3D parallelism configuration (TP x PP x DP) for training a 70B model on a 64-GPU cluster.
Setup
from mlsysim import ParallelismOptimizer, Models, Systems
optimizer = ParallelismOptimizer()
model = Models.Llama3_70B
fleet = Systems.Clusters.Research_256 # 256 H100s
Task
Since we want 64 GPUs, build a custom fleet. Then run the optimizer.
from mlsysim.systems.types import Fleet
from mlsysim.systems.registry import Nodes, Fabrics
fleet_64 = Fleet(
name="64-GPU H100 Cluster",
node=Nodes.DGX_H100,
count=8, # 8 nodes x 8 GPUs = 64
fabric=Fabrics.InfiniBand_NDR
)
result = optimizer.solve(
model, fleet_64,
batch_size=512,
precision="fp16",
efficiency=0.4,
overlap_comm=True
)
print(f"Best config: {result.best_config}")
print(f"Best MFU: {result.best_mfu:.3f}")
print(f"Best step time: {result.best_step_time:~P.1f}")
print(f"Configs explored: {result.total_searched}")
# Print top candidates
for c in result.top_candidates[:10]:
cfg = c['config']
print(f" TP={cfg['tp']:>2d} PP={cfg['pp']:>2d} DP={cfg['dp']:>2d} "
f"MFU={c['mfu']:.3f}")
Question
What is the optimal TP x PP x DP split for Llama-3-70B on 64 H100s? Why does the optimizer prefer TP=8 (one full node) for tensor parallelism?
Hint
TP communication happens over NVLink (900 GB/s within a DGX H100 node). PP communication is point-to-point (small volume). DP communication is AllReduce over InfiniBand (across nodes). The optimizer balances memory fit (TP+PP must shard the 70B model enough to fit) against communication overhead.
Expected Answer
The optimizer typically finds TP=8, PP=2, DP=4 or a nearby configuration.
- TP=8 keeps tensor parallelism within a single DGX node (8 GPUs connected by NVLink at 900 GB/s), minimizing communication latency for the 2 AllReduce operations per layer.
- PP=2 splits the 80 layers across 2 pipeline stages, reducing per-GPU memory to fit in 80 GB HBM.
- DP=4 provides data parallelism across 4 groups, allowing a reasonable local batch size of 128 per DP rank.
The MFU is typically 0.30-0.45, reflecting the pipeline bubble and communication overheads.
Discussion
What happens if you double the cluster to 128 GPUs? Does MFU go up or down? Which parallelism dimension should absorb the extra GPUs?
Exercise 5 — Carbon Geography (Part 5: Sustainability)
Learning Objective: Quantify how datacenter location affects the carbon footprint of a long training run.
Setup
from mlsysim import SustainabilityModel, Infra, Systems
sustain = SustainabilityModel()
fleet = Systems.Clusters.Research_256 # 256 H100s
Task
Compare a 30-day training run in Virginia (US Average grid) versus Quebec (hydroelectric).
# Virginia (US Average)
r_va = sustain.solve(fleet, duration_days=30, datacenter=Infra.US_Avg, mfu=0.4)
print(f"=== Virginia (US Average Grid) ===")
print(f"Energy: {r_va.total_energy_kwh:,.0f} kWh")
print(f"Carbon: {r_va.carbon_footprint_kg:,.0f} kg CO2")
print(f"Water: {r_va.water_usage_liters:,.0f} liters")
# Quebec (Hydro)
r_qc = sustain.solve(fleet, duration_days=30, datacenter=Infra.Quebec, mfu=0.4)
print(f"\n=== Quebec (Hydroelectric) ===")
print(f"Energy: {r_qc.total_energy_kwh:,.0f} kWh")
print(f"Carbon: {r_qc.carbon_footprint_kg:,.0f} kg CO2")
print(f"Water: {r_qc.water_usage_liters:,.0f} liters")
print(f"\n=== Comparison ===")
print(f"Carbon reduction: {(1 - r_qc.carbon_footprint_kg / r_va.carbon_footprint_kg) * 100:.1f}%")
print(f"Water reduction: {(1 - r_qc.water_usage_liters / r_va.water_usage_liters) * 100:.1f}%")
Question
How much carbon (in kg CO2) does moving from Virginia to Quebec save for a 30-day training run on 256 H100s? What fraction of the total energy is consumed by cooling and power delivery (not compute)?
Hint
Quebec's carbon intensity is ~1.2 gCO2/kWh (hydro) vs. US average ~390 gCO2/kWh (mixed grid). The PUE overhead is the ratio of total facility energy to IT energy -- a PUE of 1.1 means 10% overhead.
Expected Answer
- Energy: Both regions consume similar total energy (~180,000-220,000 kWh depending on PUE), since the GPUs draw the same power regardless of location.
- Carbon: Virginia produces roughly 80,000-90,000 kg CO2 while Quebec produces approximately 200-300 kg CO2 -- a ~99% reduction.
- PUE overhead: Quebec's liquid-cooled facility (PUE ~1.05) wastes ~5% on infrastructure vs. US average air-cooled (PUE ~1.1-1.2) wasting 10-20%.
The carbon savings come entirely from the grid's energy source, not from using less power. This is why datacenter location is the single highest-leverage sustainability decision.
Discussion
If you also factor in cost (electricity price per kWh), does Quebec remain
the optimal choice? Use EconomicsModel to find out. What about the
network latency penalty if your team is in California?
Exercise 6 — Pareto-Optimal Serving (Part 6: Economics and Fleet Design)
Learning Objective: Given a fixed budget, find the serving configuration that maximizes throughput while meeting latency SLAs.
Setup
from mlsysim import (
EconomicsModel, ServingModel, Hardware, Models,
Systems, Infra
)
from mlsysim.systems.types import Fleet, Node, NetworkFabric
from mlsysim.systems.registry import Nodes, Fabrics
serving = ServingModel()
econ = EconomicsModel()
model = Models.Llama3_8B
Task
Compare three hardware options for serving Llama-3-8B at 4K context, each scaled to fit within a $1M annual budget.
configs = [
("A100 cluster", Hardware.A100, Nodes.DGX_A100, 20), # ~20 nodes
("H100 cluster", Hardware.H100, Nodes.DGX_H100, 8), # ~8 nodes
("B200 cluster", Hardware.B200, Nodes.DGX_B200, 4), # ~4 nodes
]
for name, hw, node_template, n_nodes in configs:
fleet = Fleet(
name=name, node=node_template, count=n_nodes,
fabric=Fabrics.InfiniBand_NDR, region=Infra.US_Avg
)
# Economics: 365-day TCO
tco = econ.solve(fleet, duration_days=365, mfu=0.3)
# Serving: per-GPU capacity
r = serving.solve(model, hw, seq_len=4096, batch_size=32, precision="fp16")
total_gpus = fleet.total_accelerators
print(f"\n=== {name} ({total_gpus} GPUs) ===")
print(f" TCO: ${tco.tco_usd:,.0f}")
print(f" TTFT: {r.ttft:~P.1f}")
print(f" ITL: {r.itl:~P.2f}")
print(f" Feasible: {r.feasible}")
print(f" Per-GPU mem: {r.total_memory_required:~P.1f}")
Question
With a $1M annual budget, which hardware generation provides the best cost-per-request for Llama-3-8B serving? Is it always the newest GPU?
Hint
Newer GPUs have higher unit cost but also higher bandwidth and FLOPS. The Pareto-optimal choice depends on whether serving is memory-bound (bandwidth matters) or compute-bound (FLOPS matter), and how many GPUs you can afford.
Expected Answer
The answer depends on the specific unit costs in the registry, but the general finding is:
- A100: Cheapest per-unit, so you get the most GPUs, but each has lower bandwidth (2 TB/s vs. 3.35 TB/s for H100). Good for throughput-oriented workloads where you can batch aggressively.
- H100: Best balance of cost and performance for LLM serving. Higher bandwidth directly reduces ITL (inter-token latency) in the memory-bound decode phase.
- B200: Highest per-unit cost but offers FP8/INT4 support and highest bandwidth (~8 TB/s). Most cost-effective only if you can use lower precision.
The Pareto frontier typically shows H100 as the sweet spot for FP16 serving, with B200 winning if INT4/FP8 quantization is acceptable.
Discussion
How does the analysis change if you add a latency SLA (e.g., ITL < 20ms)? Does the Pareto-optimal choice shift when you constrain latency instead of just minimizing cost?
Exercise 7 — TinyML SLA Feasibility (Part 7: Edge and TinyML)
Learning Objective: Determine whether a keyword-spotting CNN can meet a real-time SLA on a microcontroller.
Setup
from mlsysim import Engine, Models, ureg
from mlsysim.hardware.types import HardwareNode, ComputeCore, MemoryHierarchy
model = Models.Tiny.DS_CNN
# Construct the nRF52840 (Cortex-M4F @ 64 MHz) -- MLPerf Tiny reference platform
hw = HardwareNode(
name="Nordic nRF52840 (Cortex-M4F)",
release_year=2018,
compute=ComputeCore(
peak_flops=0.000064 * ureg.TFLOPs / ureg.s,
precision_flops={"int8": 0.000128 * ureg.TFLOPs / ureg.s},
),
memory=MemoryHierarchy(
capacity=1 * ureg.MB,
bandwidth=0.064 * ureg.GB / ureg.s,
sram_capacity=256 * ureg.KiB,
sram_bandwidth=0.256 * ureg.GB / ureg.s,
flash_capacity=1 * ureg.MB,
flash_bandwidth=0.064 * ureg.GB / ureg.s,
),
tdp=0.015 * ureg.W,
dispatch_tax=0.5 * ureg.ms,
)
Task
Check if DS-CNN keyword spotting meets a 30ms latency SLA on the nRF52840. Then explore what happens at different precisions and efficiency levels.
# Baseline: INT8 (native on Cortex-M4F)
p_int8 = Engine.solve(model, hw, batch_size=1, precision="int8", efficiency=0.3)
print(f"=== DS-CNN on nRF52840 (INT8) ===")
print(f"Latency: {p_int8.latency:~P.2f}")
print(f"Bottleneck: {p_int8.bottleneck}")
print(f"Memory: {p_int8.memory_footprint:~P.2f}")
print(f"Feasible: {p_int8.feasible}")
print(f"Energy: {p_int8.energy:~P.4f}")
print(f"Meets 30ms: {p_int8.latency.to('ms').magnitude < 30}")
# Compare: FP16 (no hardware support -- uses FP32 path)
p_fp16 = Engine.solve(model, hw, batch_size=1, precision="fp16", efficiency=0.1)
print(f"\n=== DS-CNN on nRF52840 (FP16 emulated) ===")
print(f"Latency: {p_fp16.latency:~P.2f}")
print(f"Meets 30ms: {p_fp16.latency.to('ms').magnitude < 30}")
# Sweep efficiency to find the minimum required
for eff in [0.1, 0.2, 0.3, 0.4, 0.5]:
p = Engine.solve(model, hw, batch_size=1, precision="int8", efficiency=eff)
print(f"eff={eff:.1f} latency={p.latency:~P.2f} meets_30ms={p.latency.to('ms').magnitude < 30}")
Question
Can DS-CNN meet a 30ms SLA on the nRF52840? If not, what is the binding constraint, and what hardware or algorithmic change would close the gap?
Hint
The nRF52840 has ~128 MOPS at INT8 and ~64 MFLOPS at FP32. DS-CNN has ~6M FLOPs. Do the back-of-envelope math: 6M / (128M * efficiency). Is this compute-bound or memory-bound?
Expected Answer
- INT8 at efficiency=0.3: DS-CNN inference takes approximately 500ms on the nRF52840 -- far exceeding the 30ms SLA. The workload is compute-bound: 6M FLOPs / (128 MOPS * 0.3) ~ 156ms of raw compute, plus framework overhead (dispatch tax, layer tax) pushes it past 500ms.
- FP16 emulated: Even slower (~1500ms+) because the Cortex-M4F has no native FP16 support and falls back to the FP32 path at 64 MFLOPS.
- Energy: At 15 mW TDP over ~500ms, a single inference consumes roughly 8 millijoules -- acceptable for battery operation, but the latency is the problem, not energy.
- The gap: To meet 30ms, you would need either a faster MCU (e.g., Cortex-M7 at 480 MHz with DSP extensions) or a smaller model (fewer FLOPs). Even at efficiency=1.0, the raw compute time is 6M/128M = 47ms -- still above 30ms. The nRF52840 simply cannot meet this SLA for DS-CNN.
Discussion
This result surprises many students: a "tiny" 26K-parameter model still
cannot meet a 30ms SLA on a microcontroller. What does this teach about
the relationship between model size and latency? Try the ESP32-S3
(Hardware.Tiny.ESP32_S3) which has ~20x the compute throughput.
Does it meet 30ms? What if you needed to run MobileNetV2 -- try
Engine.solve(Models.Vision.MobileNetV2, hw, ...) and observe
the feasible field for the memory wall.
Exercise 8 — Capstone: Fleet Design Under Constraints (Parts 1-7)
Learning Objective: Design a complete serving fleet for Llama-3-70B at 1000 QPS, within a $5M annual budget, deployed across two regions.
Setup
from mlsysim import (
ServingModel, EconomicsModel, SustainabilityModel, CompressionModel,
Hardware, Models, Infra
)
from mlsysim.systems.types import Fleet
from mlsysim.systems.registry import Nodes, Fabrics
model = Models.Llama3_70B
serving = ServingModel()
econ = EconomicsModel()
sustain = SustainabilityModel()
compress = CompressionModel()
Task
Design a fleet that meets ALL constraints simultaneously:
- Throughput: 1000 QPS (queries per second) total across two regions
- Latency: ITL < 50ms per token
- Budget: < $5M annual TCO
- Carbon: < 500 tonnes CO2/year
- Regions: US East + Quebec (for redundancy)
Step 1: Determine per-GPU serving capacity.
# How many QPS per H100 at FP16?
r = serving.solve(model, Hardware.H100, seq_len=2048, batch_size=1, precision="fp16")
print(f"Single H100: TTFT={r.ttft:~P.1f} ITL={r.itl:~P.1f} feasible={r.feasible}")
# Try with INT4 compression
c = compress.solve(model, Hardware.H100, method="quantization", target_bitwidth=4)
print(f"INT4 compression: {c.compression_ratio:.1f}x accuracy_delta={c.accuracy_delta:.1f}%")
Step 2: Calculate fleet size needed for 1000 QPS.
# Estimate: each H100 can decode ~X tokens/sec for this model
# tokens_per_sec_per_gpu = 1000 / itl_in_seconds
itl_sec = r.itl.to("s").magnitude
tokens_per_sec = 1.0 / itl_sec if itl_sec > 0 else 0
print(f"Tokens/sec per GPU: {tokens_per_sec:.1f}")
print(f"GPUs needed for 1000 QPS: ~{1000 / max(tokens_per_sec, 1):.0f}")
Step 3: Check budget and carbon for each region split.
for us_pct in [0.3, 0.5, 0.7]:
qc_pct = 1.0 - us_pct
# (Adapt node count based on your QPS calculation above)
n_total = 100 # placeholder -- replace with your calculation
n_us = int(n_total * us_pct)
n_qc = n_total - n_us
fleet_us = Fleet(name="US East", node=Nodes.DGX_H100, count=max(1,n_us),
fabric=Fabrics.InfiniBand_NDR, region=Infra.US_Avg)
fleet_qc = Fleet(name="Quebec", node=Nodes.DGX_H100, count=max(1,n_qc),
fabric=Fabrics.InfiniBand_NDR, region=Infra.Quebec)
tco_us = econ.solve(fleet_us, duration_days=365, mfu=0.3)
tco_qc = econ.solve(fleet_qc, duration_days=365, mfu=0.3)
co2_us = sustain.solve(fleet_us, duration_days=365, mfu=0.3)
co2_qc = sustain.solve(fleet_qc, duration_days=365, mfu=0.3)
total_tco = tco_us.tco_usd + tco_qc.tco_usd
total_co2 = (co2_us.carbon_footprint_kg + co2_qc.carbon_footprint_kg) / 1000 # tonnes
print(f"Split {us_pct:.0%} US / {qc_pct:.0%} QC: "
f"TCO=${total_tco:,.0f} CO2={total_co2:.0f}t")
Question
What is your recommended fleet configuration? How many total GPUs, what regional split, and does INT4 quantization change your answer?
Hint
Llama-3-70B at FP16 requires ~140 GB -- it does not fit on a single H100 (80 GB). You must either use tensor parallelism (2+ GPUs per inference instance) or quantize to INT4 (~35 GB, fits on one H100). This fundamentally changes the fleet size calculation.
Expected Answer
This is an open-ended design exercise. A strong answer includes:
-
Precision choice: INT4 quantization (35 GB) fits on a single H100, halving the GPU requirement vs. FP16 (which needs TP=2 minimum). The 2-5% accuracy trade-off is usually acceptable for serving.
-
Fleet size: With INT4 and typical ITL of ~30-50ms per token, each H100 can handle roughly 20-30 QPS. For 1000 QPS total, you need approximately 35-50 GPUs (5-7 DGX H100 nodes).
-
Regional split: Placing 70% of capacity in Quebec dramatically reduces carbon (hydro grid at ~1.2 gCO2/kWh vs. US average at ~390 gCO2/kWh) while keeping 30% in US East for latency-sensitive users. Total CO2 stays well under 500 tonnes/year.
-
Budget check: 5-7 DGX H100 nodes at ~$200-300K each, plus electricity and networking, fits within the $5M annual budget.
The key insight is that compression is not just an optimization -- it is an architectural decision that changes the fleet design by 2x.
Discussion
What are the failure modes of this design? What happens when a full DGX node goes down in the Quebec region? How does the ReliabilityModel inform your redundancy strategy? Should you over-provision by N+1 or N+2?