# MLSys·im Tutorial: Hands-On Exercises
Eight exercises aligned with the eight tutorial parts.
Each exercise is self-contained and takes 5-10 minutes.
---
## Exercise 1 — The Roofline Transition (Part 1: Single-Node Performance)
**Learning Objective:** Identify the batch size where a CNN workload transitions
from memory-bound to compute-bound on a datacenter GPU.
### Setup
```python
import mlsysim
from mlsysim import Engine, Hardware, Models
model = Models.ResNet50
hw = Hardware.A100
```
### Task
Sweep `batch_size` from 1 to 512 (powers of 2). For each, call
`Engine.solve()` and record the `bottleneck` field along with the
compute and memory latency components. Find the smallest batch size
where the compute term overtakes the memory term.
```python
for bs in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]:
p = Engine.solve(model, hw, batch_size=bs, precision="fp16", efficiency=0.5)
print(f"bs={bs:>4d} bottleneck={p.bottleneck:<10s} "
f"T_compute={p.latency_compute.to('ms'):~P.3f} "
f"T_memory={p.latency_memory.to('ms'):~P.3f} "
f"throughput={p.throughput:~P.1f}")
```
### Question
At what batch size does ResNet-50's compute time overtake its memory time
on the A100? Compare `latency_compute` vs. `latency_memory` to find the
exact crossover point.
### Hint
Watch for when `latency_compute > latency_memory`. The crossover is the
Roofline ridge point. Note that the reported `bottleneck` field depends
on the overall arithmetic intensity calculation, so inspect both latency
components directly.
Expected Answer
ResNet-50 on the A100 at FP16 with efficiency=0.5 is already compute-bound
at very small batch sizes because of its high FLOP count (~8 GFLOPs per
image) relative to its small weight footprint (~51 MB at FP16). The compute
term scales linearly with batch size while the memory term grows more slowly
(weights loaded once, only activation traffic scales). For CNN workloads
like ResNet-50, the transition may already be at batch_size=1, unlike
Transformer decode which is memory-bound at batch_size=1. Compare with
`Models.Llama3_8B` to see a memory-bound regime.
### Discussion
Why does this transition point matter for production deployment? Compare
ResNet-50 (compute-bound even at batch_size=1) with Llama-3-8B (memory-bound
at batch_size=1). What hardware characteristic should you optimize for
in each case -- peak FLOPS or memory bandwidth?
---
## Exercise 2 — LLM Serving Capacity (Part 2: Serving and Inference)
**Learning Objective:** Determine how many concurrent LLM requests fit in
GPU memory, accounting for both model weights and KV-cache.
### Setup
```python
from mlsysim import ServingModel, Hardware, Models
serving = ServingModel()
model = Models.Llama3_8B
hw = Hardware.H100
```
### Task
Sweep `batch_size` from 1 to 128 at `seq_len=4096`. Find the maximum
batch size where the serving result is still `feasible`.
```python
for bs in [1, 2, 4, 8, 16, 32, 64, 128]:
r = serving.solve(model, hw, seq_len=4096, batch_size=bs, precision="fp16")
print(f"bs={bs:>4d} feasible={r.feasible} "
f"mem={r.total_memory_required:~P.1f} "
f"kv_cache={r.kv_cache_size:~P.1f} "
f"TTFT={r.ttft:~P.1f} ITL={r.itl:~P.2f}")
```
### Question
How many concurrent Llama-3-8B requests can a single H100 (80 GB) serve
at 4K context in FP16? What is the dominant memory consumer at that
maximum batch size -- weights or KV-cache?
### Hint
H100 has 80 GB HBM3. Llama-3-8B at FP16 is ~16 GB of weights.
Each request's KV-cache at 4096 tokens depends on the model's
layer count, head count, and head dimension.
Expected Answer
The maximum concurrent batch size is approximately **128 requests** on
the H100 (80 GiB = ~85.9 GB). At FP16, Llama-3-8B weights consume ~16 GB.
Each request's KV-cache at 4096 tokens is ~0.5 GB, so after reserving
~16 GB for weights, the remaining ~70 GB fits roughly 128 KV-cache slots.
At maximum capacity, **KV-cache dominates** memory usage (~69 GB KV vs.
~16 GB weights). At bs=160, total memory exceeds capacity and becomes
infeasible.
### Discussion
What happens if you switch from FP16 to INT8 quantization? The weights
halve, but the KV-cache also shrinks. Which effect matters more for
concurrent serving capacity?
---
## Exercise 3 — Quantization Trade-offs (Part 3: Compression and Efficiency)
**Learning Objective:** Quantify the memory savings and inference speedup
from INT4 quantization, and understand the accuracy trade-off.
### Setup
```python
from mlsysim import CompressionModel, Engine, Hardware, Models
compress = CompressionModel()
model = Models.Llama3_8B
hw = Hardware.H100
```
### Task
Compare the FP16 baseline with INT4 quantization. Run the compression
model and then run `Engine.solve()` at both precisions.
```python
# Compression analysis
result = compress.solve(model, hw, method="quantization", target_bitwidth=4)
print(f"Compression ratio: {result.compression_ratio:.1f}x")
print(f"Original size: {result.original_size_gb:~P.2f}")
print(f"Compressed size: {result.compressed_size_gb:~P.2f}")
print(f"Memory savings: {result.memory_savings_pct:.1f}%")
print(f"Inference speedup: {result.inference_speedup:.2f}x")
print(f"Accuracy delta: {result.estimated_accuracy_delta:.2f}%")
# Roofline comparison
fp16 = Engine.solve(model, hw, batch_size=1, precision="fp16", efficiency=0.5)
int4 = Engine.solve(model, hw, batch_size=1, precision="int4", efficiency=0.5)
print(f"\nFP16 latency: {fp16.latency:~P.2f} bottleneck: {fp16.bottleneck}")
print(f"INT4 latency: {int4.latency:~P.2f} bottleneck: {int4.bottleneck}")
```
### Question
What is the memory savings from FP32 baseline to INT4 for Llama-3-8B?
What is the estimated accuracy degradation? Is the speedup closer to
8x or 4x, and why?
### Hint
The `CompressionModel` measures compression ratio from the **FP32 baseline**
(32-bit), so INT4 gives an 8x ratio on paper (32/4). But inference speedup
depends on whether the workload is compute-bound or memory-bound. At
batch_size=1, LLM inference is memory-bound, so the speedup tracks with
the reduction in bytes moved, not FLOPS saved.
Expected Answer
- **Memory savings:** ~87.5% (8x compression from FP32 to INT4 baseline),
reducing the model from ~32 GB (FP32) to ~4 GB (INT4).
- **Accuracy delta:** Approximately 2-5% degradation (conservative estimate
from the Gholami et al. survey).
- **Inference speedup:** At batch_size=1, the workload is memory-bound, so
the speedup is roughly **proportional to the bytes reduction**. Compared to
FP16 inference (the practical baseline), the speedup from INT4 is ~4x in
memory traffic. The exact speedup depends on whether the hardware has
native INT4 execution units (B200 does, H100 does not).
### Discussion
When would you choose INT8 over INT4? Consider a deployment where you
need <1% accuracy loss but also need to fit the model on a single GPU.
What is the optimal compression point?
---
## Exercise 4 — Parallelism Strategy Search (Part 4: Distributed Training)
**Learning Objective:** Find the optimal 3D parallelism configuration
(TP x PP x DP) for training a 70B model on a 64-GPU cluster.
### Setup
```python
from mlsysim import ParallelismOptimizer, Models, Systems
optimizer = ParallelismOptimizer()
model = Models.Llama3_70B
fleet = Systems.Clusters.Research_256 # 256 H100s
```
### Task
Since we want 64 GPUs, build a custom fleet. Then run the optimizer.
```python
from mlsysim.systems.types import Fleet
from mlsysim.systems.registry import Nodes, Fabrics
fleet_64 = Fleet(
name="64-GPU H100 Cluster",
node=Nodes.DGX_H100,
count=8, # 8 nodes x 8 GPUs = 64
fabric=Fabrics.InfiniBand_NDR
)
result = optimizer.solve(
model, fleet_64,
batch_size=512,
precision="fp16",
efficiency=0.4,
overlap_comm=True
)
print(f"Best config: {result.best_config}")
print(f"Best MFU: {result.best_mfu:.3f}")
print(f"Best step time: {result.best_step_time:~P.1f}")
print(f"Configs explored: {result.total_searched}")
# Print top candidates
for c in result.top_candidates[:10]:
cfg = c['config']
print(f" TP={cfg['tp']:>2d} PP={cfg['pp']:>2d} DP={cfg['dp']:>2d} "
f"MFU={c['mfu']:.3f}")
```
### Question
What is the optimal TP x PP x DP split for Llama-3-70B on 64 H100s?
Why does the optimizer prefer TP=8 (one full node) for tensor parallelism?
### Hint
TP communication happens over NVLink (900 GB/s within a DGX H100 node).
PP communication is point-to-point (small volume). DP communication is
AllReduce over InfiniBand (across nodes). The optimizer balances memory
fit (TP+PP must shard the 70B model enough to fit) against communication
overhead.
Expected Answer
The optimizer typically finds **TP=8, PP=2, DP=4** or a nearby configuration.
- **TP=8** keeps tensor parallelism within a single DGX node (8 GPUs connected
by NVLink at 900 GB/s), minimizing communication latency for the 2 AllReduce
operations per layer.
- **PP=2** splits the 80 layers across 2 pipeline stages, reducing per-GPU
memory to fit in 80 GB HBM.
- **DP=4** provides data parallelism across 4 groups, allowing a reasonable
local batch size of 128 per DP rank.
The MFU is typically 0.30-0.45, reflecting the pipeline bubble and
communication overheads.
### Discussion
What happens if you double the cluster to 128 GPUs? Does MFU go up or
down? Which parallelism dimension should absorb the extra GPUs?
---
## Exercise 5 — Carbon Geography (Part 5: Sustainability)
**Learning Objective:** Quantify how datacenter location affects the
carbon footprint of a long training run.
### Setup
```python
from mlsysim import SustainabilityModel, Infra, Systems
sustain = SustainabilityModel()
fleet = Systems.Clusters.Research_256 # 256 H100s
```
### Task
Compare a 30-day training run in Virginia (US Average grid) versus
Quebec (hydroelectric).
```python
# Virginia (US Average)
r_va = sustain.solve(fleet, duration_days=30, datacenter=Infra.US_Avg, mfu=0.4)
print(f"=== Virginia (US Average Grid) ===")
print(f"Energy: {r_va.total_energy_kwh:,.0f} kWh")
print(f"Carbon: {r_va.carbon_footprint_kg:,.0f} kg CO2")
print(f"Water: {r_va.water_usage_liters:,.0f} liters")
# Quebec (Hydro)
r_qc = sustain.solve(fleet, duration_days=30, datacenter=Infra.Quebec, mfu=0.4)
print(f"\n=== Quebec (Hydroelectric) ===")
print(f"Energy: {r_qc.total_energy_kwh:,.0f} kWh")
print(f"Carbon: {r_qc.carbon_footprint_kg:,.0f} kg CO2")
print(f"Water: {r_qc.water_usage_liters:,.0f} liters")
print(f"\n=== Comparison ===")
print(f"Carbon reduction: {(1 - r_qc.carbon_footprint_kg / r_va.carbon_footprint_kg) * 100:.1f}%")
print(f"Water reduction: {(1 - r_qc.water_usage_liters / r_va.water_usage_liters) * 100:.1f}%")
```
### Question
How much carbon (in kg CO2) does moving from Virginia to Quebec save for
a 30-day training run on 256 H100s? What fraction of the total energy
is consumed by cooling and power delivery (not compute)?
### Hint
Quebec's carbon intensity is ~1.2 gCO2/kWh (hydro) vs. US average
~390 gCO2/kWh (mixed grid). The PUE overhead is the ratio of total
facility energy to IT energy -- a PUE of 1.1 means 10% overhead.
Expected Answer
- **Energy:** Both regions consume similar total energy (~180,000-220,000 kWh
depending on PUE), since the GPUs draw the same power regardless of location.
- **Carbon:** Virginia produces roughly **80,000-90,000 kg CO2** while Quebec
produces approximately **200-300 kg CO2** -- a **~99% reduction**.
- **PUE overhead:** Quebec's liquid-cooled facility (PUE ~1.05) wastes ~5% on
infrastructure vs. US average air-cooled (PUE ~1.1-1.2) wasting 10-20%.
The carbon savings come entirely from the grid's energy source, not from
using less power. This is why datacenter location is the single highest-leverage
sustainability decision.
### Discussion
If you also factor in cost (electricity price per kWh), does Quebec remain
the optimal choice? Use `EconomicsModel` to find out. What about the
network latency penalty if your team is in California?
---
## Exercise 6 — Pareto-Optimal Serving (Part 6: Economics and Fleet Design)
**Learning Objective:** Given a fixed budget, find the serving configuration
that maximizes throughput while meeting latency SLAs.
### Setup
```python
from mlsysim import (
EconomicsModel, ServingModel, Hardware, Models,
Systems, Infra
)
from mlsysim.systems.types import Fleet, Node, NetworkFabric
from mlsysim.systems.registry import Nodes, Fabrics
serving = ServingModel()
econ = EconomicsModel()
model = Models.Llama3_8B
```
### Task
Compare three hardware options for serving Llama-3-8B at 4K context,
each scaled to fit within a $1M annual budget.
```python
configs = [
("A100 cluster", Hardware.A100, Nodes.DGX_A100, 20), # ~20 nodes
("H100 cluster", Hardware.H100, Nodes.DGX_H100, 8), # ~8 nodes
("B200 cluster", Hardware.B200, Nodes.DGX_B200, 4), # ~4 nodes
]
for name, hw, node_template, n_nodes in configs:
fleet = Fleet(
name=name, node=node_template, count=n_nodes,
fabric=Fabrics.InfiniBand_NDR, region=Infra.US_Avg
)
# Economics: 365-day TCO
tco = econ.solve(fleet, duration_days=365, mfu=0.3)
# Serving: per-GPU capacity
r = serving.solve(model, hw, seq_len=4096, batch_size=32, precision="fp16")
total_gpus = fleet.total_accelerators
print(f"\n=== {name} ({total_gpus} GPUs) ===")
print(f" TCO: ${tco.tco_usd:,.0f}")
print(f" TTFT: {r.ttft:~P.1f}")
print(f" ITL: {r.itl:~P.2f}")
print(f" Feasible: {r.feasible}")
print(f" Per-GPU mem: {r.total_memory_required:~P.1f}")
```
### Question
With a $1M annual budget, which hardware generation provides the best
cost-per-request for Llama-3-8B serving? Is it always the newest GPU?
### Hint
Newer GPUs have higher unit cost but also higher bandwidth and FLOPS.
The Pareto-optimal choice depends on whether serving is memory-bound
(bandwidth matters) or compute-bound (FLOPS matter), and how many
GPUs you can afford.
Expected Answer
The answer depends on the specific unit costs in the registry, but the
general finding is:
- **A100:** Cheapest per-unit, so you get the most GPUs, but each has lower
bandwidth (2 TB/s vs. 3.35 TB/s for H100). Good for throughput-oriented
workloads where you can batch aggressively.
- **H100:** Best balance of cost and performance for LLM serving. Higher
bandwidth directly reduces ITL (inter-token latency) in the memory-bound
decode phase.
- **B200:** Highest per-unit cost but offers FP8/INT4 support and highest
bandwidth (~8 TB/s). Most cost-effective only if you can use lower precision.
The Pareto frontier typically shows **H100 as the sweet spot** for FP16
serving, with B200 winning if INT4/FP8 quantization is acceptable.
### Discussion
How does the analysis change if you add a latency SLA (e.g., ITL < 20ms)?
Does the Pareto-optimal choice shift when you constrain latency instead
of just minimizing cost?
---
## Exercise 7 — TinyML SLA Feasibility (Part 7: Edge and TinyML)
**Learning Objective:** Determine whether a keyword-spotting CNN can meet
a real-time SLA on a microcontroller.
### Setup
```python
from mlsysim import Engine, Models, ureg
from mlsysim.hardware.types import HardwareNode, ComputeCore, MemoryHierarchy
model = Models.Tiny.DS_CNN
# Construct the nRF52840 (Cortex-M4F @ 64 MHz) -- MLPerf Tiny reference platform
hw = HardwareNode(
name="Nordic nRF52840 (Cortex-M4F)",
release_year=2018,
compute=ComputeCore(
peak_flops=0.000064 * ureg.TFLOPs / ureg.s,
precision_flops={"int8": 0.000128 * ureg.TFLOPs / ureg.s},
),
memory=MemoryHierarchy(
capacity=1 * ureg.MB,
bandwidth=0.064 * ureg.GB / ureg.s,
sram_capacity=256 * ureg.KiB,
sram_bandwidth=0.256 * ureg.GB / ureg.s,
flash_capacity=1 * ureg.MB,
flash_bandwidth=0.064 * ureg.GB / ureg.s,
),
tdp=0.015 * ureg.W,
dispatch_tax=0.5 * ureg.ms,
)
```
### Task
Check if DS-CNN keyword spotting meets a 30ms latency SLA on the
nRF52840. Then explore what happens at different precisions and
efficiency levels.
```python
# Baseline: INT8 (native on Cortex-M4F)
p_int8 = Engine.solve(model, hw, batch_size=1, precision="int8", efficiency=0.3)
print(f"=== DS-CNN on nRF52840 (INT8) ===")
print(f"Latency: {p_int8.latency:~P.2f}")
print(f"Bottleneck: {p_int8.bottleneck}")
print(f"Memory: {p_int8.memory_footprint:~P.2f}")
print(f"Feasible: {p_int8.feasible}")
print(f"Energy: {p_int8.energy:~P.4f}")
print(f"Meets 30ms: {p_int8.latency.to('ms').magnitude < 30}")
# Compare: FP16 (no hardware support -- uses FP32 path)
p_fp16 = Engine.solve(model, hw, batch_size=1, precision="fp16", efficiency=0.1)
print(f"\n=== DS-CNN on nRF52840 (FP16 emulated) ===")
print(f"Latency: {p_fp16.latency:~P.2f}")
print(f"Meets 30ms: {p_fp16.latency.to('ms').magnitude < 30}")
# Sweep efficiency to find the minimum required
for eff in [0.1, 0.2, 0.3, 0.4, 0.5]:
p = Engine.solve(model, hw, batch_size=1, precision="int8", efficiency=eff)
print(f"eff={eff:.1f} latency={p.latency:~P.2f} meets_30ms={p.latency.to('ms').magnitude < 30}")
```
### Question
Can DS-CNN meet a 30ms SLA on the nRF52840? If not, what is the binding
constraint, and what hardware or algorithmic change would close the gap?
### Hint
The nRF52840 has ~128 MOPS at INT8 and ~64 MFLOPS at FP32. DS-CNN
has ~6M FLOPs. Do the back-of-envelope math: 6M / (128M * efficiency).
Is this compute-bound or memory-bound?
Expected Answer
- **INT8 at efficiency=0.3:** DS-CNN inference takes approximately
**500ms** on the nRF52840 -- far exceeding the 30ms SLA. The workload is
**compute-bound**: 6M FLOPs / (128 MOPS * 0.3) ~ 156ms of raw compute,
plus framework overhead (dispatch tax, layer tax) pushes it past 500ms.
- **FP16 emulated:** Even slower (~1500ms+) because the Cortex-M4F has no
native FP16 support and falls back to the FP32 path at 64 MFLOPS.
- **Energy:** At 15 mW TDP over ~500ms, a single inference consumes roughly
**8 millijoules** -- acceptable for battery operation, but the latency
is the problem, not energy.
- **The gap:** To meet 30ms, you would need either a faster MCU (e.g.,
Cortex-M7 at 480 MHz with DSP extensions) or a smaller model (fewer FLOPs).
Even at efficiency=1.0, the raw compute time is 6M/128M = 47ms -- still
above 30ms. The nRF52840 simply cannot meet this SLA for DS-CNN.
### Discussion
This result surprises many students: a "tiny" 26K-parameter model still
cannot meet a 30ms SLA on a microcontroller. What does this teach about
the relationship between model size and latency? Try the ESP32-S3
(`Hardware.Tiny.ESP32_S3`) which has ~20x the compute throughput.
Does it meet 30ms? What if you needed to run MobileNetV2 -- try
`Engine.solve(Models.Vision.MobileNetV2, hw, ...)` and observe
the `feasible` field for the memory wall.
---
## Exercise 8 — Capstone: Fleet Design Under Constraints (Parts 1-7)
**Learning Objective:** Design a complete serving fleet for Llama-3-70B
at 1000 QPS, within a $5M annual budget, deployed across two regions.
### Setup
```python
from mlsysim import (
ServingModel, EconomicsModel, SustainabilityModel, CompressionModel,
Hardware, Models, Infra
)
from mlsysim.systems.types import Fleet
from mlsysim.systems.registry import Nodes, Fabrics
model = Models.Llama3_70B
serving = ServingModel()
econ = EconomicsModel()
sustain = SustainabilityModel()
compress = CompressionModel()
```
### Task
Design a fleet that meets ALL constraints simultaneously:
- **Throughput:** 1000 QPS (queries per second) total across two regions
- **Latency:** ITL < 50ms per token
- **Budget:** < $5M annual TCO
- **Carbon:** < 500 tonnes CO2/year
- **Regions:** US East + Quebec (for redundancy)
**Step 1:** Determine per-GPU serving capacity.
```python
# How many QPS per H100 at FP16?
r = serving.solve(model, Hardware.H100, seq_len=2048, batch_size=1, precision="fp16")
print(f"Single H100: TTFT={r.ttft:~P.1f} ITL={r.itl:~P.1f} feasible={r.feasible}")
# Try with INT4 compression
c = compress.solve(model, Hardware.H100, method="quantization", target_bitwidth=4)
print(f"INT4 compression: {c.compression_ratio:.1f}x accuracy_delta={c.accuracy_delta:.1f}%")
```
**Step 2:** Calculate fleet size needed for 1000 QPS.
```python
# Estimate: each H100 can decode ~X tokens/sec for this model
# tokens_per_sec_per_gpu = 1000 / itl_in_seconds
itl_sec = r.itl.to("s").magnitude
tokens_per_sec = 1.0 / itl_sec if itl_sec > 0 else 0
print(f"Tokens/sec per GPU: {tokens_per_sec:.1f}")
print(f"GPUs needed for 1000 QPS: ~{1000 / max(tokens_per_sec, 1):.0f}")
```
**Step 3:** Check budget and carbon for each region split.
```python
for us_pct in [0.3, 0.5, 0.7]:
qc_pct = 1.0 - us_pct
# (Adapt node count based on your QPS calculation above)
n_total = 100 # placeholder -- replace with your calculation
n_us = int(n_total * us_pct)
n_qc = n_total - n_us
fleet_us = Fleet(name="US East", node=Nodes.DGX_H100, count=max(1,n_us),
fabric=Fabrics.InfiniBand_NDR, region=Infra.US_Avg)
fleet_qc = Fleet(name="Quebec", node=Nodes.DGX_H100, count=max(1,n_qc),
fabric=Fabrics.InfiniBand_NDR, region=Infra.Quebec)
tco_us = econ.solve(fleet_us, duration_days=365, mfu=0.3)
tco_qc = econ.solve(fleet_qc, duration_days=365, mfu=0.3)
co2_us = sustain.solve(fleet_us, duration_days=365, mfu=0.3)
co2_qc = sustain.solve(fleet_qc, duration_days=365, mfu=0.3)
total_tco = tco_us.tco_usd + tco_qc.tco_usd
total_co2 = (co2_us.carbon_footprint_kg + co2_qc.carbon_footprint_kg) / 1000 # tonnes
print(f"Split {us_pct:.0%} US / {qc_pct:.0%} QC: "
f"TCO=${total_tco:,.0f} CO2={total_co2:.0f}t")
```
### Question
What is your recommended fleet configuration? How many total GPUs,
what regional split, and does INT4 quantization change your answer?
### Hint
Llama-3-70B at FP16 requires ~140 GB -- it does not fit on a single
H100 (80 GB). You must either use tensor parallelism (2+ GPUs per
inference instance) or quantize to INT4 (~35 GB, fits on one H100).
This fundamentally changes the fleet size calculation.
Expected Answer
This is an open-ended design exercise. A strong answer includes:
1. **Precision choice:** INT4 quantization (35 GB) fits on a single H100,
halving the GPU requirement vs. FP16 (which needs TP=2 minimum).
The 2-5% accuracy trade-off is usually acceptable for serving.
2. **Fleet size:** With INT4 and typical ITL of ~30-50ms per token, each
H100 can handle roughly 20-30 QPS. For 1000 QPS total, you need
approximately 35-50 GPUs (5-7 DGX H100 nodes).
3. **Regional split:** Placing 70% of capacity in Quebec dramatically
reduces carbon (hydro grid at ~1.2 gCO2/kWh vs. US average at
~390 gCO2/kWh) while keeping 30% in US East for latency-sensitive
users. Total CO2 stays well under 500 tonnes/year.
4. **Budget check:** 5-7 DGX H100 nodes at ~$200-300K each, plus
electricity and networking, fits within the $5M annual budget.
The key insight is that **compression is not just an optimization -- it is
an architectural decision** that changes the fleet design by 2x.
### Discussion
What are the failure modes of this design? What happens when a full
DGX node goes down in the Quebec region? How does the ReliabilityModel
inform your redundancy strategy? Should you over-provision by N+1 or N+2?