mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-06 17:49:07 -05:00

Files

Vijay Janapa Reddi 1eb30f5f86 fix(mlsysim): harden release QA and paper artifacts

Align the MLSys·im code, docs, paper, website, workflows, and lab wheel for the 0.1.1 release. This also fixes runtime/API issues found during release review and prepares the paper PDF plus archive package.

2026-04-25 10:06:01 -04:00

24 KiB

Raw Permalink Blame History

MLSys·im Tutorial: Hands-On Exercises

Eight exercises aligned with the eight tutorial parts. Each exercise is self-contained and takes 5-10 minutes.

Exercise 1 — The Roofline Transition (Part 1: Single-Node Performance)

Learning Objective: Identify the batch size where a CNN workload transitions from memory-bound to compute-bound on a datacenter GPU.

Setup

import mlsysim
from mlsysim import Engine, Hardware, Models

model = Models.ResNet50
hw    = Hardware.A100

Task

Sweep batch_size from 1 to 512 (powers of 2). For each, call Engine.solve() and record the bottleneck field along with the compute and memory latency components. Find the smallest batch size where the compute term overtakes the memory term.

for bs in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]:
    p = Engine.solve(model, hw, batch_size=bs, precision="fp16", efficiency=0.5)
    print(f"bs={bs:>4d}  bottleneck={p.bottleneck:<10s}  "
          f"T_compute={p.latency_compute.to('ms'):~P.3f}  "
          f"T_memory={p.latency_memory.to('ms'):~P.3f}  "
          f"throughput={p.throughput:~P.1f}")

Question

At what batch size does ResNet-50's compute time overtake its memory time on the A100? Compare latency_compute vs. latency_memory to find the exact crossover point.

Hint

Watch for when latency_compute > latency_memory. The crossover is the Roofline ridge point. Note that the reported bottleneck field depends on the overall arithmetic intensity calculation, so inspect both latency components directly.

Expected Answer

ResNet-50 on the A100 at FP16 with efficiency=0.5 is already compute-bound at very small batch sizes because of its high FLOP count (~8 GFLOPs per image) relative to its small weight footprint (~51 MB at FP16). The compute term scales linearly with batch size while the memory term grows more slowly (weights loaded once, only activation traffic scales). For CNN workloads like ResNet-50, the transition may already be at batch_size=1, unlike Transformer decode which is memory-bound at batch_size=1. Compare with Models.Llama3_8B to see a memory-bound regime.

Discussion

Why does this transition point matter for production deployment? Compare ResNet-50 (compute-bound even at batch_size=1) with Llama-3-8B (memory-bound at batch_size=1). What hardware characteristic should you optimize for in each case -- peak FLOPS or memory bandwidth?

Exercise 2 — LLM Serving Capacity (Part 2: Serving and Inference)

Learning Objective: Determine how many concurrent LLM requests fit in GPU memory, accounting for both model weights and KV-cache.

Setup

from mlsysim import ServingModel, Hardware, Models

serving = ServingModel()
model   = Models.Llama3_8B
hw      = Hardware.H100

Task

Sweep batch_size from 1 to 128 at seq_len=4096. Find the maximum batch size where the serving result is still feasible.

for bs in [1, 2, 4, 8, 16, 32, 64, 128]:
    r = serving.solve(model, hw, seq_len=4096, batch_size=bs, precision="fp16")
    print(f"bs={bs:>4d}  feasible={r.feasible}  "
          f"mem={r.total_memory_required:~P.1f}  "
          f"kv_cache={r.kv_cache_size:~P.1f}  "
          f"TTFT={r.ttft:~P.1f}  ITL={r.itl:~P.2f}")

Question

How many concurrent Llama-3-8B requests can a single H100 (80 GB) serve at 4K context in FP16? What is the dominant memory consumer at that maximum batch size -- weights or KV-cache?

Hint

H100 has 80 GB HBM3. Llama-3-8B at FP16 is ~16 GB of weights. Each request's KV-cache at 4096 tokens depends on the model's layer count, head count, and head dimension.

Expected Answer

The maximum concurrent batch size is approximately 128 requests on the H100 (80 GiB = ~85.9 GB). At FP16, Llama-3-8B weights consume ~16 GB. Each request's KV-cache at 4096 tokens is ~0.5 GB, so after reserving ~16 GB for weights, the remaining ~70 GB fits roughly 128 KV-cache slots. At maximum capacity, KV-cache dominates memory usage (~69 GB KV vs. ~16 GB weights). At bs=160, total memory exceeds capacity and becomes infeasible.

Discussion

What happens if you switch from FP16 to INT8 quantization? The weights halve, but the KV-cache also shrinks. Which effect matters more for concurrent serving capacity?

Exercise 3 — Quantization Trade-offs (Part 3: Compression and Efficiency)

Learning Objective: Quantify the memory savings and inference speedup from INT4 quantization, and understand the accuracy trade-off.

Setup

from mlsysim import CompressionModel, Engine, Hardware, Models

compress = CompressionModel()
model    = Models.Llama3_8B
hw       = Hardware.H100

Task

Compare the FP16 baseline with INT4 quantization. Run the compression model and then run Engine.solve() at both precisions.

# Compression analysis
result = compress.solve(model, hw, method="quantization", target_bitwidth=4)
print(f"Compression ratio:   {result.compression_ratio:.1f}x")
print(f"Original size:       {result.original_size_gb:~P.2f}")
print(f"Compressed size:     {result.compressed_size_gb:~P.2f}")
print(f"Memory savings:      {result.memory_savings_pct:.1f}%")
print(f"Inference speedup:   {result.inference_speedup:.2f}x")
print(f"Accuracy delta:      {result.estimated_accuracy_delta:.2f}%")

# Roofline comparison
fp16 = Engine.solve(model, hw, batch_size=1, precision="fp16", efficiency=0.5)
int4 = Engine.solve(model, hw, batch_size=1, precision="int4", efficiency=0.5)
print(f"\nFP16 latency: {fp16.latency:~P.2f}  bottleneck: {fp16.bottleneck}")
print(f"INT4 latency: {int4.latency:~P.2f}  bottleneck: {int4.bottleneck}")

Question

What is the memory savings from FP32 baseline to INT4 for Llama-3-8B? What is the estimated accuracy degradation? Is the speedup closer to 8x or 4x, and why?

Hint

The CompressionModel measures compression ratio from the FP32 baseline (32-bit), so INT4 gives an 8x ratio on paper (32/4). But inference speedup depends on whether the workload is compute-bound or memory-bound. At batch_size=1, LLM inference is memory-bound, so the speedup tracks with the reduction in bytes moved, not FLOPS saved.

Expected Answer

Memory savings: ~87.5% (8x compression from FP32 to INT4 baseline), reducing the model from ~32 GB (FP32) to ~4 GB (INT4).
Accuracy delta: Approximately 2-5% degradation (conservative estimate from the Gholami et al. survey).
Inference speedup: At batch_size=1, the workload is memory-bound, so the speedup is roughly proportional to the bytes reduction. Compared to FP16 inference (the practical baseline), the speedup from INT4 is ~4x in memory traffic. The exact speedup depends on whether the hardware has native INT4 execution units (B200 does, H100 does not).

Discussion

When would you choose INT8 over INT4? Consider a deployment where you need <1% accuracy loss but also need to fit the model on a single GPU. What is the optimal compression point?

Exercise 4 — Parallelism Strategy Search (Part 4: Distributed Training)

Learning Objective: Find the optimal 3D parallelism configuration (TP x PP x DP) for training a 70B model on a 64-GPU cluster.

Setup

from mlsysim import ParallelismOptimizer, Models, Systems

optimizer = ParallelismOptimizer()
model     = Models.Llama3_70B
fleet     = Systems.Clusters.Research_256  # 256 H100s

Task

Since we want 64 GPUs, build a custom fleet. Then run the optimizer.

from mlsysim.systems.types import Fleet
from mlsysim.systems.registry import Nodes, Fabrics

fleet_64 = Fleet(
    name="64-GPU H100 Cluster",
    node=Nodes.DGX_H100,
    count=8,  # 8 nodes x 8 GPUs = 64
    fabric=Fabrics.InfiniBand_NDR
)

result = optimizer.solve(
    model, fleet_64,
    batch_size=512,
    precision="fp16",
    efficiency=0.4,
    overlap_comm=True
)

print(f"Best config:  {result.best_config}")
print(f"Best MFU:     {result.best_mfu:.3f}")
print(f"Best step time: {result.best_step_time:~P.1f}")
print(f"Configs explored: {result.total_searched}")

# Print top candidates
for c in result.top_candidates[:10]:
    cfg = c['config']
    print(f"  TP={cfg['tp']:>2d}  PP={cfg['pp']:>2d}  DP={cfg['dp']:>2d}  "
          f"MFU={c['mfu']:.3f}")

Question

What is the optimal TP x PP x DP split for Llama-3-70B on 64 H100s? Why does the optimizer prefer TP=8 (one full node) for tensor parallelism?

Hint

TP communication happens over NVLink (900 GB/s within a DGX H100 node). PP communication is point-to-point (small volume). DP communication is AllReduce over InfiniBand (across nodes). The optimizer balances memory fit (TP+PP must shard the 70B model enough to fit) against communication overhead.

Expected Answer

The optimizer typically finds TP=8, PP=2, DP=4 or a nearby configuration.

TP=8 keeps tensor parallelism within a single DGX node (8 GPUs connected by NVLink at 900 GB/s), minimizing communication latency for the 2 AllReduce operations per layer.
PP=2 splits the 80 layers across 2 pipeline stages, reducing per-GPU memory to fit in 80 GB HBM.
DP=4 provides data parallelism across 4 groups, allowing a reasonable local batch size of 128 per DP rank.

The MFU is typically 0.30-0.45, reflecting the pipeline bubble and communication overheads.

Discussion

What happens if you double the cluster to 128 GPUs? Does MFU go up or down? Which parallelism dimension should absorb the extra GPUs?

Exercise 5 — Carbon Geography (Part 5: Sustainability)

Learning Objective: Quantify how datacenter location affects the carbon footprint of a long training run.

Setup

from mlsysim import SustainabilityModel, Infra, Systems

sustain = SustainabilityModel()
fleet   = Systems.Clusters.Research_256  # 256 H100s

Task

Compare a 30-day training run in Virginia (US Average grid) versus Quebec (hydroelectric).

# Virginia (US Average)
r_va = sustain.solve(fleet, duration_days=30, datacenter=Infra.US_Avg, mfu=0.4)
print(f"=== Virginia (US Average Grid) ===")
print(f"Energy:  {r_va.total_energy_kwh:,.0f} kWh")
print(f"Carbon:  {r_va.carbon_footprint_kg:,.0f} kg CO2")
print(f"Water:   {r_va.water_usage_liters:,.0f} liters")

# Quebec (Hydro)
r_qc = sustain.solve(fleet, duration_days=30, datacenter=Infra.Quebec, mfu=0.4)
print(f"\n=== Quebec (Hydroelectric) ===")
print(f"Energy:  {r_qc.total_energy_kwh:,.0f} kWh")
print(f"Carbon:  {r_qc.carbon_footprint_kg:,.0f} kg CO2")
print(f"Water:   {r_qc.water_usage_liters:,.0f} liters")

print(f"\n=== Comparison ===")
print(f"Carbon reduction:  {(1 - r_qc.carbon_footprint_kg / r_va.carbon_footprint_kg) * 100:.1f}%")
print(f"Water reduction:   {(1 - r_qc.water_usage_liters / r_va.water_usage_liters) * 100:.1f}%")

Question

How much carbon (in kg CO2) does moving from Virginia to Quebec save for a 30-day training run on 256 H100s? What fraction of the total energy is consumed by cooling and power delivery (not compute)?

Hint

Quebec's carbon intensity is ~1.2 gCO2/kWh (hydro) vs. US average ~390 gCO2/kWh (mixed grid). The PUE overhead is the ratio of total facility energy to IT energy -- a PUE of 1.1 means 10% overhead.

Expected Answer

Energy: Both regions consume similar total energy (~180,000-220,000 kWh depending on PUE), since the GPUs draw the same power regardless of location.
Carbon: Virginia produces roughly 80,000-90,000 kg CO2 while Quebec produces approximately 200-300 kg CO2 -- a ~99% reduction.
PUE overhead: Quebec's liquid-cooled facility (PUE ~1.05) wastes ~5% on infrastructure vs. US average air-cooled (PUE ~1.1-1.2) wasting 10-20%.

The carbon savings come entirely from the grid's energy source, not from using less power. This is why datacenter location is the single highest-leverage sustainability decision.

Discussion

If you also factor in cost (electricity price per kWh), does Quebec remain the optimal choice? Use EconomicsModel to find out. What about the network latency penalty if your team is in California?

Exercise 6 — Pareto-Optimal Serving (Part 6: Economics and Fleet Design)

Learning Objective: Given a fixed budget, find the serving configuration that maximizes throughput while meeting latency SLAs.

Setup

from mlsysim import (
    EconomicsModel, ServingModel, Hardware, Models,
    Systems, Infra
)
from mlsysim.systems.types import Fleet, Node, NetworkFabric
from mlsysim.systems.registry import Nodes, Fabrics

serving = ServingModel()
econ    = EconomicsModel()
model   = Models.Llama3_8B

Task

Compare three hardware options for serving Llama-3-8B at 4K context, each scaled to fit within a $1M annual budget.

configs = [
    ("A100 cluster", Hardware.A100, Nodes.DGX_A100, 20),   # ~20 nodes
    ("H100 cluster", Hardware.H100, Nodes.DGX_H100, 8),    # ~8 nodes
    ("B200 cluster", Hardware.B200, Nodes.DGX_B200, 4),     # ~4 nodes
]

for name, hw, node_template, n_nodes in configs:
    fleet = Fleet(
        name=name, node=node_template, count=n_nodes,
        fabric=Fabrics.InfiniBand_NDR, region=Infra.US_Avg
    )
    # Economics: 365-day TCO
    tco = econ.solve(fleet, duration_days=365, mfu=0.3)

    # Serving: per-GPU capacity
    r = serving.solve(model, hw, seq_len=4096, batch_size=32, precision="fp16")

    total_gpus = fleet.total_accelerators
    print(f"\n=== {name} ({total_gpus} GPUs) ===")
    print(f"  TCO:        ${tco.tco_usd:,.0f}")
    print(f"  TTFT:       {r.ttft:~P.1f}")
    print(f"  ITL:        {r.itl:~P.2f}")
    print(f"  Feasible:   {r.feasible}")
    print(f"  Per-GPU mem: {r.total_memory_required:~P.1f}")

Question

With a $1M annual budget, which hardware generation provides the best cost-per-request for Llama-3-8B serving? Is it always the newest GPU?

Hint

Newer GPUs have higher unit cost but also higher bandwidth and FLOPS. The Pareto-optimal choice depends on whether serving is memory-bound (bandwidth matters) or compute-bound (FLOPS matter), and how many GPUs you can afford.

Expected Answer

The answer depends on the specific unit costs in the registry, but the general finding is:

A100: Cheapest per-unit, so you get the most GPUs, but each has lower bandwidth (2 TB/s vs. 3.35 TB/s for H100). Good for throughput-oriented workloads where you can batch aggressively.
H100: Best balance of cost and performance for LLM serving. Higher bandwidth directly reduces ITL (inter-token latency) in the memory-bound decode phase.
B200: Highest per-unit cost but offers FP8/INT4 support and highest bandwidth (~8 TB/s). Most cost-effective only if you can use lower precision.

The Pareto frontier typically shows H100 as the sweet spot for FP16 serving, with B200 winning if INT4/FP8 quantization is acceptable.

Discussion

How does the analysis change if you add a latency SLA (e.g., ITL < 20ms)? Does the Pareto-optimal choice shift when you constrain latency instead of just minimizing cost?

Exercise 7 — TinyML SLA Feasibility (Part 7: Edge and TinyML)

Learning Objective: Determine whether a keyword-spotting CNN can meet a real-time SLA on a microcontroller.

Setup

from mlsysim import Engine, Models, ureg
from mlsysim.hardware.types import HardwareNode, ComputeCore, MemoryHierarchy

model = Models.Tiny.DS_CNN

# Construct the nRF52840 (Cortex-M4F @ 64 MHz) -- MLPerf Tiny reference platform
hw = HardwareNode(
    name="Nordic nRF52840 (Cortex-M4F)",
    release_year=2018,
    compute=ComputeCore(
        peak_flops=0.000064 * ureg.TFLOPs / ureg.s,
        precision_flops={"int8": 0.000128 * ureg.TFLOPs / ureg.s},
    ),
    memory=MemoryHierarchy(
        capacity=1 * ureg.MB,
        bandwidth=0.064 * ureg.GB / ureg.s,
        sram_capacity=256 * ureg.KiB,
        sram_bandwidth=0.256 * ureg.GB / ureg.s,
        flash_capacity=1 * ureg.MB,
        flash_bandwidth=0.064 * ureg.GB / ureg.s,
    ),
    tdp=0.015 * ureg.W,
    dispatch_tax=0.5 * ureg.ms,
)

Task

Check if DS-CNN keyword spotting meets a 30ms latency SLA on the nRF52840. Then explore what happens at different precisions and efficiency levels.

# Baseline: INT8 (native on Cortex-M4F)
p_int8 = Engine.solve(model, hw, batch_size=1, precision="int8", efficiency=0.3)
print(f"=== DS-CNN on nRF52840 (INT8) ===")
print(f"Latency:    {p_int8.latency:~P.2f}")
print(f"Bottleneck: {p_int8.bottleneck}")
print(f"Memory:     {p_int8.memory_footprint:~P.2f}")
print(f"Feasible:   {p_int8.feasible}")
print(f"Energy:     {p_int8.energy:~P.4f}")
print(f"Meets 30ms: {p_int8.latency.to('ms').magnitude < 30}")

# Compare: FP16 (no hardware support -- uses FP32 path)
p_fp16 = Engine.solve(model, hw, batch_size=1, precision="fp16", efficiency=0.1)
print(f"\n=== DS-CNN on nRF52840 (FP16 emulated) ===")
print(f"Latency:    {p_fp16.latency:~P.2f}")
print(f"Meets 30ms: {p_fp16.latency.to('ms').magnitude < 30}")

# Sweep efficiency to find the minimum required
for eff in [0.1, 0.2, 0.3, 0.4, 0.5]:
    p = Engine.solve(model, hw, batch_size=1, precision="int8", efficiency=eff)
    print(f"eff={eff:.1f}  latency={p.latency:~P.2f}  meets_30ms={p.latency.to('ms').magnitude < 30}")

Question

Can DS-CNN meet a 30ms SLA on the nRF52840? If not, what is the binding constraint, and what hardware or algorithmic change would close the gap?

Hint

The nRF52840 has ~128 MOPS at INT8 and ~64 MFLOPS at FP32. DS-CNN has ~6M FLOPs. Do the back-of-envelope math: 6M / (128M * efficiency). Is this compute-bound or memory-bound?

Expected Answer

INT8 at efficiency=0.3: DS-CNN inference takes approximately 500ms on the nRF52840 -- far exceeding the 30ms SLA. The workload is compute-bound: 6M FLOPs / (128 MOPS * 0.3) ~ 156ms of raw compute, plus framework overhead (dispatch tax, layer tax) pushes it past 500ms.
FP16 emulated: Even slower (~1500ms+) because the Cortex-M4F has no native FP16 support and falls back to the FP32 path at 64 MFLOPS.
Energy: At 15 mW TDP over ~500ms, a single inference consumes roughly 8 millijoules -- acceptable for battery operation, but the latency is the problem, not energy.
The gap: To meet 30ms, you would need either a faster MCU (e.g., Cortex-M7 at 480 MHz with DSP extensions) or a smaller model (fewer FLOPs). Even at efficiency=1.0, the raw compute time is 6M/128M = 47ms -- still above 30ms. The nRF52840 simply cannot meet this SLA for DS-CNN.

Discussion

This result surprises many students: a "tiny" 26K-parameter model still cannot meet a 30ms SLA on a microcontroller. What does this teach about the relationship between model size and latency? Try the ESP32-S3 (Hardware.Tiny.ESP32_S3) which has ~20x the compute throughput. Does it meet 30ms? What if you needed to run MobileNetV2 -- try Engine.solve(Models.Vision.MobileNetV2, hw, ...) and observe the feasible field for the memory wall.

Exercise 8 — Capstone: Fleet Design Under Constraints (Parts 1-7)

Learning Objective: Design a complete serving fleet for Llama-3-70B at 1000 QPS, within a $5M annual budget, deployed across two regions.

Setup

from mlsysim import (
    ServingModel, EconomicsModel, SustainabilityModel, CompressionModel,
    Hardware, Models, Infra
)
from mlsysim.systems.types import Fleet
from mlsysim.systems.registry import Nodes, Fabrics

model   = Models.Llama3_70B
serving = ServingModel()
econ    = EconomicsModel()
sustain = SustainabilityModel()
compress = CompressionModel()

Task

Design a fleet that meets ALL constraints simultaneously:

Throughput: 1000 QPS (queries per second) total across two regions
Latency: ITL < 50ms per token
Budget: < $5M annual TCO
Carbon: < 500 tonnes CO2/year
Regions: US East + Quebec (for redundancy)

Step 1: Determine per-GPU serving capacity.

# How many QPS per H100 at FP16?
r = serving.solve(model, Hardware.H100, seq_len=2048, batch_size=1, precision="fp16")
print(f"Single H100: TTFT={r.ttft:~P.1f}  ITL={r.itl:~P.1f}  feasible={r.feasible}")

# Try with INT4 compression
c = compress.solve(model, Hardware.H100, method="quantization", target_bitwidth=4)
print(f"INT4 compression: {c.compression_ratio:.1f}x  accuracy_delta={c.accuracy_delta:.1f}%")

Step 2: Calculate fleet size needed for 1000 QPS.

# Estimate: each H100 can decode ~X tokens/sec for this model
# tokens_per_sec_per_gpu = 1000 / itl_in_seconds
itl_sec = r.itl.to("s").magnitude
tokens_per_sec = 1.0 / itl_sec if itl_sec > 0 else 0
print(f"Tokens/sec per GPU: {tokens_per_sec:.1f}")
print(f"GPUs needed for 1000 QPS: ~{1000 / max(tokens_per_sec, 1):.0f}")

Step 3: Check budget and carbon for each region split.

for us_pct in [0.3, 0.5, 0.7]:
    qc_pct = 1.0 - us_pct
    # (Adapt node count based on your QPS calculation above)
    n_total = 100  # placeholder -- replace with your calculation
    n_us = int(n_total * us_pct)
    n_qc = n_total - n_us

    fleet_us = Fleet(name="US East", node=Nodes.DGX_H100, count=max(1,n_us),
                     fabric=Fabrics.InfiniBand_NDR, region=Infra.US_Avg)
    fleet_qc = Fleet(name="Quebec", node=Nodes.DGX_H100, count=max(1,n_qc),
                     fabric=Fabrics.InfiniBand_NDR, region=Infra.Quebec)

    tco_us = econ.solve(fleet_us, duration_days=365, mfu=0.3)
    tco_qc = econ.solve(fleet_qc, duration_days=365, mfu=0.3)
    co2_us = sustain.solve(fleet_us, duration_days=365, mfu=0.3)
    co2_qc = sustain.solve(fleet_qc, duration_days=365, mfu=0.3)

    total_tco = tco_us.tco_usd + tco_qc.tco_usd
    total_co2 = (co2_us.carbon_footprint_kg + co2_qc.carbon_footprint_kg) / 1000  # tonnes
    print(f"Split {us_pct:.0%} US / {qc_pct:.0%} QC:  "
          f"TCO=${total_tco:,.0f}  CO2={total_co2:.0f}t")

Question

What is your recommended fleet configuration? How many total GPUs, what regional split, and does INT4 quantization change your answer?

Hint

Llama-3-70B at FP16 requires ~140 GB -- it does not fit on a single H100 (80 GB). You must either use tensor parallelism (2+ GPUs per inference instance) or quantize to INT4 (~35 GB, fits on one H100). This fundamentally changes the fleet size calculation.

Expected Answer

This is an open-ended design exercise. A strong answer includes:

Precision choice: INT4 quantization (35 GB) fits on a single H100, halving the GPU requirement vs. FP16 (which needs TP=2 minimum). The 2-5% accuracy trade-off is usually acceptable for serving.
Fleet size: With INT4 and typical ITL of ~30-50ms per token, each H100 can handle roughly 20-30 QPS. For 1000 QPS total, you need approximately 35-50 GPUs (5-7 DGX H100 nodes).
Regional split: Placing 70% of capacity in Quebec dramatically reduces carbon (hydro grid at ~1.2 gCO2/kWh vs. US average at ~390 gCO2/kWh) while keeping 30% in US East for latency-sensitive users. Total CO2 stays well under 500 tonnes/year.
Budget check: 5-7 DGX H100 nodes at ~$200-300K each, plus electricity and networking, fits within the $5M annual budget.

The key insight is that compression is not just an optimization -- it is an architectural decision that changes the fleet design by 2x.

Discussion

What are the failure modes of this design? What happens when a full DGX node goes down in the Quebec region? How does the ReliabilityModel inform your redundancy strategy? Should you over-provision by N+1 or N+2?

24 KiB Raw Permalink Blame History

MLSys·im Tutorial: Hands-On Exercises

Exercise 1 — The Roofline Transition (Part 1: Single-Node Performance)

Setup

Task

Question

Hint

Discussion

Exercise 2 — LLM Serving Capacity (Part 2: Serving and Inference)

Setup

Task

Question

Hint

Discussion

Exercise 3 — Quantization Trade-offs (Part 3: Compression and Efficiency)

Setup

Task

Question

Hint

Discussion

Exercise 4 — Parallelism Strategy Search (Part 4: Distributed Training)

Setup

Task

Question

Hint

Discussion

Exercise 5 — Carbon Geography (Part 5: Sustainability)

Setup

Task

Question

Hint

Discussion

Exercise 6 — Pareto-Optimal Serving (Part 6: Economics and Fleet Design)

Setup

Task

Question

Hint

Discussion

Exercise 7 — TinyML SLA Feasibility (Part 7: Edge and TinyML)

Setup

Task

Question

Hint

Discussion

Exercise 8 — Capstone: Fleet Design Under Constraints (Parts 1-7)

Setup

Task

Question

Hint

Discussion

24 KiB

Raw Permalink Blame History