mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-06 17:49:07 -05:00
Align the MLSys·im code, docs, paper, website, workflows, and lab wheel for the 0.1.1 release. This also fixes runtime/API issues found during release review and prepares the paper PDF plus archive package.
694 lines
24 KiB
Markdown
694 lines
24 KiB
Markdown
# MLSys·im Tutorial: Hands-On Exercises
|
|
|
|
Eight exercises aligned with the eight tutorial parts.
|
|
Each exercise is self-contained and takes 5-10 minutes.
|
|
|
|
---
|
|
|
|
## Exercise 1 — The Roofline Transition (Part 1: Single-Node Performance)
|
|
|
|
**Learning Objective:** Identify the batch size where a CNN workload transitions
|
|
from memory-bound to compute-bound on a datacenter GPU.
|
|
|
|
### Setup
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import Engine, Hardware, Models
|
|
|
|
model = Models.ResNet50
|
|
hw = Hardware.A100
|
|
```
|
|
|
|
### Task
|
|
|
|
Sweep `batch_size` from 1 to 512 (powers of 2). For each, call
|
|
`Engine.solve()` and record the `bottleneck` field along with the
|
|
compute and memory latency components. Find the smallest batch size
|
|
where the compute term overtakes the memory term.
|
|
|
|
```python
|
|
for bs in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]:
|
|
p = Engine.solve(model, hw, batch_size=bs, precision="fp16", efficiency=0.5)
|
|
print(f"bs={bs:>4d} bottleneck={p.bottleneck:<10s} "
|
|
f"T_compute={p.latency_compute.to('ms'):~P.3f} "
|
|
f"T_memory={p.latency_memory.to('ms'):~P.3f} "
|
|
f"throughput={p.throughput:~P.1f}")
|
|
```
|
|
|
|
### Question
|
|
|
|
At what batch size does ResNet-50's compute time overtake its memory time
|
|
on the A100? Compare `latency_compute` vs. `latency_memory` to find the
|
|
exact crossover point.
|
|
|
|
### Hint
|
|
|
|
Watch for when `latency_compute > latency_memory`. The crossover is the
|
|
Roofline ridge point. Note that the reported `bottleneck` field depends
|
|
on the overall arithmetic intensity calculation, so inspect both latency
|
|
components directly.
|
|
|
|
<details>
|
|
<summary><strong>Expected Answer</strong></summary>
|
|
|
|
ResNet-50 on the A100 at FP16 with efficiency=0.5 is already compute-bound
|
|
at very small batch sizes because of its high FLOP count (~8 GFLOPs per
|
|
image) relative to its small weight footprint (~51 MB at FP16). The compute
|
|
term scales linearly with batch size while the memory term grows more slowly
|
|
(weights loaded once, only activation traffic scales). For CNN workloads
|
|
like ResNet-50, the transition may already be at batch_size=1, unlike
|
|
Transformer decode which is memory-bound at batch_size=1. Compare with
|
|
`Models.Llama3_8B` to see a memory-bound regime.
|
|
|
|
</details>
|
|
|
|
### Discussion
|
|
|
|
Why does this transition point matter for production deployment? Compare
|
|
ResNet-50 (compute-bound even at batch_size=1) with Llama-3-8B (memory-bound
|
|
at batch_size=1). What hardware characteristic should you optimize for
|
|
in each case -- peak FLOPS or memory bandwidth?
|
|
|
|
---
|
|
|
|
## Exercise 2 — LLM Serving Capacity (Part 2: Serving and Inference)
|
|
|
|
**Learning Objective:** Determine how many concurrent LLM requests fit in
|
|
GPU memory, accounting for both model weights and KV-cache.
|
|
|
|
### Setup
|
|
|
|
```python
|
|
from mlsysim import ServingModel, Hardware, Models
|
|
|
|
serving = ServingModel()
|
|
model = Models.Llama3_8B
|
|
hw = Hardware.H100
|
|
```
|
|
|
|
### Task
|
|
|
|
Sweep `batch_size` from 1 to 128 at `seq_len=4096`. Find the maximum
|
|
batch size where the serving result is still `feasible`.
|
|
|
|
```python
|
|
for bs in [1, 2, 4, 8, 16, 32, 64, 128]:
|
|
r = serving.solve(model, hw, seq_len=4096, batch_size=bs, precision="fp16")
|
|
print(f"bs={bs:>4d} feasible={r.feasible} "
|
|
f"mem={r.total_memory_required:~P.1f} "
|
|
f"kv_cache={r.kv_cache_size:~P.1f} "
|
|
f"TTFT={r.ttft:~P.1f} ITL={r.itl:~P.2f}")
|
|
```
|
|
|
|
### Question
|
|
|
|
How many concurrent Llama-3-8B requests can a single H100 (80 GB) serve
|
|
at 4K context in FP16? What is the dominant memory consumer at that
|
|
maximum batch size -- weights or KV-cache?
|
|
|
|
### Hint
|
|
|
|
H100 has 80 GB HBM3. Llama-3-8B at FP16 is ~16 GB of weights.
|
|
Each request's KV-cache at 4096 tokens depends on the model's
|
|
layer count, head count, and head dimension.
|
|
|
|
<details>
|
|
<summary><strong>Expected Answer</strong></summary>
|
|
|
|
The maximum concurrent batch size is approximately **128 requests** on
|
|
the H100 (80 GiB = ~85.9 GB). At FP16, Llama-3-8B weights consume ~16 GB.
|
|
Each request's KV-cache at 4096 tokens is ~0.5 GB, so after reserving
|
|
~16 GB for weights, the remaining ~70 GB fits roughly 128 KV-cache slots.
|
|
At maximum capacity, **KV-cache dominates** memory usage (~69 GB KV vs.
|
|
~16 GB weights). At bs=160, total memory exceeds capacity and becomes
|
|
infeasible.
|
|
|
|
</details>
|
|
|
|
### Discussion
|
|
|
|
What happens if you switch from FP16 to INT8 quantization? The weights
|
|
halve, but the KV-cache also shrinks. Which effect matters more for
|
|
concurrent serving capacity?
|
|
|
|
---
|
|
|
|
## Exercise 3 — Quantization Trade-offs (Part 3: Compression and Efficiency)
|
|
|
|
**Learning Objective:** Quantify the memory savings and inference speedup
|
|
from INT4 quantization, and understand the accuracy trade-off.
|
|
|
|
### Setup
|
|
|
|
```python
|
|
from mlsysim import CompressionModel, Engine, Hardware, Models
|
|
|
|
compress = CompressionModel()
|
|
model = Models.Llama3_8B
|
|
hw = Hardware.H100
|
|
```
|
|
|
|
### Task
|
|
|
|
Compare the FP16 baseline with INT4 quantization. Run the compression
|
|
model and then run `Engine.solve()` at both precisions.
|
|
|
|
```python
|
|
# Compression analysis
|
|
result = compress.solve(model, hw, method="quantization", target_bitwidth=4)
|
|
print(f"Compression ratio: {result.compression_ratio:.1f}x")
|
|
print(f"Original size: {result.original_size_gb:~P.2f}")
|
|
print(f"Compressed size: {result.compressed_size_gb:~P.2f}")
|
|
print(f"Memory savings: {result.memory_savings_pct:.1f}%")
|
|
print(f"Inference speedup: {result.inference_speedup:.2f}x")
|
|
print(f"Accuracy delta: {result.estimated_accuracy_delta:.2f}%")
|
|
|
|
# Roofline comparison
|
|
fp16 = Engine.solve(model, hw, batch_size=1, precision="fp16", efficiency=0.5)
|
|
int4 = Engine.solve(model, hw, batch_size=1, precision="int4", efficiency=0.5)
|
|
print(f"\nFP16 latency: {fp16.latency:~P.2f} bottleneck: {fp16.bottleneck}")
|
|
print(f"INT4 latency: {int4.latency:~P.2f} bottleneck: {int4.bottleneck}")
|
|
```
|
|
|
|
### Question
|
|
|
|
What is the memory savings from FP32 baseline to INT4 for Llama-3-8B?
|
|
What is the estimated accuracy degradation? Is the speedup closer to
|
|
8x or 4x, and why?
|
|
|
|
### Hint
|
|
|
|
The `CompressionModel` measures compression ratio from the **FP32 baseline**
|
|
(32-bit), so INT4 gives an 8x ratio on paper (32/4). But inference speedup
|
|
depends on whether the workload is compute-bound or memory-bound. At
|
|
batch_size=1, LLM inference is memory-bound, so the speedup tracks with
|
|
the reduction in bytes moved, not FLOPS saved.
|
|
|
|
<details>
|
|
<summary><strong>Expected Answer</strong></summary>
|
|
|
|
- **Memory savings:** ~87.5% (8x compression from FP32 to INT4 baseline),
|
|
reducing the model from ~32 GB (FP32) to ~4 GB (INT4).
|
|
- **Accuracy delta:** Approximately 2-5% degradation (conservative estimate
|
|
from the Gholami et al. survey).
|
|
- **Inference speedup:** At batch_size=1, the workload is memory-bound, so
|
|
the speedup is roughly **proportional to the bytes reduction**. Compared to
|
|
FP16 inference (the practical baseline), the speedup from INT4 is ~4x in
|
|
memory traffic. The exact speedup depends on whether the hardware has
|
|
native INT4 execution units (B200 does, H100 does not).
|
|
|
|
</details>
|
|
|
|
### Discussion
|
|
|
|
When would you choose INT8 over INT4? Consider a deployment where you
|
|
need <1% accuracy loss but also need to fit the model on a single GPU.
|
|
What is the optimal compression point?
|
|
|
|
---
|
|
|
|
## Exercise 4 — Parallelism Strategy Search (Part 4: Distributed Training)
|
|
|
|
**Learning Objective:** Find the optimal 3D parallelism configuration
|
|
(TP x PP x DP) for training a 70B model on a 64-GPU cluster.
|
|
|
|
### Setup
|
|
|
|
```python
|
|
from mlsysim import ParallelismOptimizer, Models, Systems
|
|
|
|
optimizer = ParallelismOptimizer()
|
|
model = Models.Llama3_70B
|
|
fleet = Systems.Clusters.Research_256 # 256 H100s
|
|
```
|
|
|
|
### Task
|
|
|
|
Since we want 64 GPUs, build a custom fleet. Then run the optimizer.
|
|
|
|
```python
|
|
from mlsysim.systems.types import Fleet
|
|
from mlsysim.systems.registry import Nodes, Fabrics
|
|
|
|
fleet_64 = Fleet(
|
|
name="64-GPU H100 Cluster",
|
|
node=Nodes.DGX_H100,
|
|
count=8, # 8 nodes x 8 GPUs = 64
|
|
fabric=Fabrics.InfiniBand_NDR
|
|
)
|
|
|
|
result = optimizer.solve(
|
|
model, fleet_64,
|
|
batch_size=512,
|
|
precision="fp16",
|
|
efficiency=0.4,
|
|
overlap_comm=True
|
|
)
|
|
|
|
print(f"Best config: {result.best_config}")
|
|
print(f"Best MFU: {result.best_mfu:.3f}")
|
|
print(f"Best step time: {result.best_step_time:~P.1f}")
|
|
print(f"Configs explored: {result.total_searched}")
|
|
|
|
# Print top candidates
|
|
for c in result.top_candidates[:10]:
|
|
cfg = c['config']
|
|
print(f" TP={cfg['tp']:>2d} PP={cfg['pp']:>2d} DP={cfg['dp']:>2d} "
|
|
f"MFU={c['mfu']:.3f}")
|
|
```
|
|
|
|
### Question
|
|
|
|
What is the optimal TP x PP x DP split for Llama-3-70B on 64 H100s?
|
|
Why does the optimizer prefer TP=8 (one full node) for tensor parallelism?
|
|
|
|
### Hint
|
|
|
|
TP communication happens over NVLink (900 GB/s within a DGX H100 node).
|
|
PP communication is point-to-point (small volume). DP communication is
|
|
AllReduce over InfiniBand (across nodes). The optimizer balances memory
|
|
fit (TP+PP must shard the 70B model enough to fit) against communication
|
|
overhead.
|
|
|
|
<details>
|
|
<summary><strong>Expected Answer</strong></summary>
|
|
|
|
The optimizer typically finds **TP=8, PP=2, DP=4** or a nearby configuration.
|
|
|
|
- **TP=8** keeps tensor parallelism within a single DGX node (8 GPUs connected
|
|
by NVLink at 900 GB/s), minimizing communication latency for the 2 AllReduce
|
|
operations per layer.
|
|
- **PP=2** splits the 80 layers across 2 pipeline stages, reducing per-GPU
|
|
memory to fit in 80 GB HBM.
|
|
- **DP=4** provides data parallelism across 4 groups, allowing a reasonable
|
|
local batch size of 128 per DP rank.
|
|
|
|
The MFU is typically 0.30-0.45, reflecting the pipeline bubble and
|
|
communication overheads.
|
|
|
|
</details>
|
|
|
|
### Discussion
|
|
|
|
What happens if you double the cluster to 128 GPUs? Does MFU go up or
|
|
down? Which parallelism dimension should absorb the extra GPUs?
|
|
|
|
---
|
|
|
|
## Exercise 5 — Carbon Geography (Part 5: Sustainability)
|
|
|
|
**Learning Objective:** Quantify how datacenter location affects the
|
|
carbon footprint of a long training run.
|
|
|
|
### Setup
|
|
|
|
```python
|
|
from mlsysim import SustainabilityModel, Infra, Systems
|
|
|
|
sustain = SustainabilityModel()
|
|
fleet = Systems.Clusters.Research_256 # 256 H100s
|
|
```
|
|
|
|
### Task
|
|
|
|
Compare a 30-day training run in Virginia (US Average grid) versus
|
|
Quebec (hydroelectric).
|
|
|
|
```python
|
|
# Virginia (US Average)
|
|
r_va = sustain.solve(fleet, duration_days=30, datacenter=Infra.US_Avg, mfu=0.4)
|
|
print(f"=== Virginia (US Average Grid) ===")
|
|
print(f"Energy: {r_va.total_energy_kwh:,.0f} kWh")
|
|
print(f"Carbon: {r_va.carbon_footprint_kg:,.0f} kg CO2")
|
|
print(f"Water: {r_va.water_usage_liters:,.0f} liters")
|
|
|
|
# Quebec (Hydro)
|
|
r_qc = sustain.solve(fleet, duration_days=30, datacenter=Infra.Quebec, mfu=0.4)
|
|
print(f"\n=== Quebec (Hydroelectric) ===")
|
|
print(f"Energy: {r_qc.total_energy_kwh:,.0f} kWh")
|
|
print(f"Carbon: {r_qc.carbon_footprint_kg:,.0f} kg CO2")
|
|
print(f"Water: {r_qc.water_usage_liters:,.0f} liters")
|
|
|
|
print(f"\n=== Comparison ===")
|
|
print(f"Carbon reduction: {(1 - r_qc.carbon_footprint_kg / r_va.carbon_footprint_kg) * 100:.1f}%")
|
|
print(f"Water reduction: {(1 - r_qc.water_usage_liters / r_va.water_usage_liters) * 100:.1f}%")
|
|
```
|
|
|
|
### Question
|
|
|
|
How much carbon (in kg CO2) does moving from Virginia to Quebec save for
|
|
a 30-day training run on 256 H100s? What fraction of the total energy
|
|
is consumed by cooling and power delivery (not compute)?
|
|
|
|
### Hint
|
|
|
|
Quebec's carbon intensity is ~1.2 gCO2/kWh (hydro) vs. US average
|
|
~390 gCO2/kWh (mixed grid). The PUE overhead is the ratio of total
|
|
facility energy to IT energy -- a PUE of 1.1 means 10% overhead.
|
|
|
|
<details>
|
|
<summary><strong>Expected Answer</strong></summary>
|
|
|
|
- **Energy:** Both regions consume similar total energy (~180,000-220,000 kWh
|
|
depending on PUE), since the GPUs draw the same power regardless of location.
|
|
- **Carbon:** Virginia produces roughly **80,000-90,000 kg CO2** while Quebec
|
|
produces approximately **200-300 kg CO2** -- a **~99% reduction**.
|
|
- **PUE overhead:** Quebec's liquid-cooled facility (PUE ~1.05) wastes ~5% on
|
|
infrastructure vs. US average air-cooled (PUE ~1.1-1.2) wasting 10-20%.
|
|
|
|
The carbon savings come entirely from the grid's energy source, not from
|
|
using less power. This is why datacenter location is the single highest-leverage
|
|
sustainability decision.
|
|
|
|
</details>
|
|
|
|
### Discussion
|
|
|
|
If you also factor in cost (electricity price per kWh), does Quebec remain
|
|
the optimal choice? Use `EconomicsModel` to find out. What about the
|
|
network latency penalty if your team is in California?
|
|
|
|
---
|
|
|
|
## Exercise 6 — Pareto-Optimal Serving (Part 6: Economics and Fleet Design)
|
|
|
|
**Learning Objective:** Given a fixed budget, find the serving configuration
|
|
that maximizes throughput while meeting latency SLAs.
|
|
|
|
### Setup
|
|
|
|
```python
|
|
from mlsysim import (
|
|
EconomicsModel, ServingModel, Hardware, Models,
|
|
Systems, Infra
|
|
)
|
|
from mlsysim.systems.types import Fleet, Node, NetworkFabric
|
|
from mlsysim.systems.registry import Nodes, Fabrics
|
|
|
|
serving = ServingModel()
|
|
econ = EconomicsModel()
|
|
model = Models.Llama3_8B
|
|
```
|
|
|
|
### Task
|
|
|
|
Compare three hardware options for serving Llama-3-8B at 4K context,
|
|
each scaled to fit within a $1M annual budget.
|
|
|
|
```python
|
|
configs = [
|
|
("A100 cluster", Hardware.A100, Nodes.DGX_A100, 20), # ~20 nodes
|
|
("H100 cluster", Hardware.H100, Nodes.DGX_H100, 8), # ~8 nodes
|
|
("B200 cluster", Hardware.B200, Nodes.DGX_B200, 4), # ~4 nodes
|
|
]
|
|
|
|
for name, hw, node_template, n_nodes in configs:
|
|
fleet = Fleet(
|
|
name=name, node=node_template, count=n_nodes,
|
|
fabric=Fabrics.InfiniBand_NDR, region=Infra.US_Avg
|
|
)
|
|
# Economics: 365-day TCO
|
|
tco = econ.solve(fleet, duration_days=365, mfu=0.3)
|
|
|
|
# Serving: per-GPU capacity
|
|
r = serving.solve(model, hw, seq_len=4096, batch_size=32, precision="fp16")
|
|
|
|
total_gpus = fleet.total_accelerators
|
|
print(f"\n=== {name} ({total_gpus} GPUs) ===")
|
|
print(f" TCO: ${tco.tco_usd:,.0f}")
|
|
print(f" TTFT: {r.ttft:~P.1f}")
|
|
print(f" ITL: {r.itl:~P.2f}")
|
|
print(f" Feasible: {r.feasible}")
|
|
print(f" Per-GPU mem: {r.total_memory_required:~P.1f}")
|
|
```
|
|
|
|
### Question
|
|
|
|
With a $1M annual budget, which hardware generation provides the best
|
|
cost-per-request for Llama-3-8B serving? Is it always the newest GPU?
|
|
|
|
### Hint
|
|
|
|
Newer GPUs have higher unit cost but also higher bandwidth and FLOPS.
|
|
The Pareto-optimal choice depends on whether serving is memory-bound
|
|
(bandwidth matters) or compute-bound (FLOPS matter), and how many
|
|
GPUs you can afford.
|
|
|
|
<details>
|
|
<summary><strong>Expected Answer</strong></summary>
|
|
|
|
The answer depends on the specific unit costs in the registry, but the
|
|
general finding is:
|
|
|
|
- **A100:** Cheapest per-unit, so you get the most GPUs, but each has lower
|
|
bandwidth (2 TB/s vs. 3.35 TB/s for H100). Good for throughput-oriented
|
|
workloads where you can batch aggressively.
|
|
- **H100:** Best balance of cost and performance for LLM serving. Higher
|
|
bandwidth directly reduces ITL (inter-token latency) in the memory-bound
|
|
decode phase.
|
|
- **B200:** Highest per-unit cost but offers FP8/INT4 support and highest
|
|
bandwidth (~8 TB/s). Most cost-effective only if you can use lower precision.
|
|
|
|
The Pareto frontier typically shows **H100 as the sweet spot** for FP16
|
|
serving, with B200 winning if INT4/FP8 quantization is acceptable.
|
|
|
|
</details>
|
|
|
|
### Discussion
|
|
|
|
How does the analysis change if you add a latency SLA (e.g., ITL < 20ms)?
|
|
Does the Pareto-optimal choice shift when you constrain latency instead
|
|
of just minimizing cost?
|
|
|
|
---
|
|
|
|
## Exercise 7 — TinyML SLA Feasibility (Part 7: Edge and TinyML)
|
|
|
|
**Learning Objective:** Determine whether a keyword-spotting CNN can meet
|
|
a real-time SLA on a microcontroller.
|
|
|
|
### Setup
|
|
|
|
```python
|
|
from mlsysim import Engine, Models, ureg
|
|
from mlsysim.hardware.types import HardwareNode, ComputeCore, MemoryHierarchy
|
|
|
|
model = Models.Tiny.DS_CNN
|
|
|
|
# Construct the nRF52840 (Cortex-M4F @ 64 MHz) -- MLPerf Tiny reference platform
|
|
hw = HardwareNode(
|
|
name="Nordic nRF52840 (Cortex-M4F)",
|
|
release_year=2018,
|
|
compute=ComputeCore(
|
|
peak_flops=0.000064 * ureg.TFLOPs / ureg.s,
|
|
precision_flops={"int8": 0.000128 * ureg.TFLOPs / ureg.s},
|
|
),
|
|
memory=MemoryHierarchy(
|
|
capacity=1 * ureg.MB,
|
|
bandwidth=0.064 * ureg.GB / ureg.s,
|
|
sram_capacity=256 * ureg.KiB,
|
|
sram_bandwidth=0.256 * ureg.GB / ureg.s,
|
|
flash_capacity=1 * ureg.MB,
|
|
flash_bandwidth=0.064 * ureg.GB / ureg.s,
|
|
),
|
|
tdp=0.015 * ureg.W,
|
|
dispatch_tax=0.5 * ureg.ms,
|
|
)
|
|
```
|
|
|
|
### Task
|
|
|
|
Check if DS-CNN keyword spotting meets a 30ms latency SLA on the
|
|
nRF52840. Then explore what happens at different precisions and
|
|
efficiency levels.
|
|
|
|
```python
|
|
# Baseline: INT8 (native on Cortex-M4F)
|
|
p_int8 = Engine.solve(model, hw, batch_size=1, precision="int8", efficiency=0.3)
|
|
print(f"=== DS-CNN on nRF52840 (INT8) ===")
|
|
print(f"Latency: {p_int8.latency:~P.2f}")
|
|
print(f"Bottleneck: {p_int8.bottleneck}")
|
|
print(f"Memory: {p_int8.memory_footprint:~P.2f}")
|
|
print(f"Feasible: {p_int8.feasible}")
|
|
print(f"Energy: {p_int8.energy:~P.4f}")
|
|
print(f"Meets 30ms: {p_int8.latency.to('ms').magnitude < 30}")
|
|
|
|
# Compare: FP16 (no hardware support -- uses FP32 path)
|
|
p_fp16 = Engine.solve(model, hw, batch_size=1, precision="fp16", efficiency=0.1)
|
|
print(f"\n=== DS-CNN on nRF52840 (FP16 emulated) ===")
|
|
print(f"Latency: {p_fp16.latency:~P.2f}")
|
|
print(f"Meets 30ms: {p_fp16.latency.to('ms').magnitude < 30}")
|
|
|
|
# Sweep efficiency to find the minimum required
|
|
for eff in [0.1, 0.2, 0.3, 0.4, 0.5]:
|
|
p = Engine.solve(model, hw, batch_size=1, precision="int8", efficiency=eff)
|
|
print(f"eff={eff:.1f} latency={p.latency:~P.2f} meets_30ms={p.latency.to('ms').magnitude < 30}")
|
|
```
|
|
|
|
### Question
|
|
|
|
Can DS-CNN meet a 30ms SLA on the nRF52840? If not, what is the binding
|
|
constraint, and what hardware or algorithmic change would close the gap?
|
|
|
|
### Hint
|
|
|
|
The nRF52840 has ~128 MOPS at INT8 and ~64 MFLOPS at FP32. DS-CNN
|
|
has ~6M FLOPs. Do the back-of-envelope math: 6M / (128M * efficiency).
|
|
Is this compute-bound or memory-bound?
|
|
|
|
<details>
|
|
<summary><strong>Expected Answer</strong></summary>
|
|
|
|
- **INT8 at efficiency=0.3:** DS-CNN inference takes approximately
|
|
**500ms** on the nRF52840 -- far exceeding the 30ms SLA. The workload is
|
|
**compute-bound**: 6M FLOPs / (128 MOPS * 0.3) ~ 156ms of raw compute,
|
|
plus framework overhead (dispatch tax, layer tax) pushes it past 500ms.
|
|
- **FP16 emulated:** Even slower (~1500ms+) because the Cortex-M4F has no
|
|
native FP16 support and falls back to the FP32 path at 64 MFLOPS.
|
|
- **Energy:** At 15 mW TDP over ~500ms, a single inference consumes roughly
|
|
**8 millijoules** -- acceptable for battery operation, but the latency
|
|
is the problem, not energy.
|
|
- **The gap:** To meet 30ms, you would need either a faster MCU (e.g.,
|
|
Cortex-M7 at 480 MHz with DSP extensions) or a smaller model (fewer FLOPs).
|
|
Even at efficiency=1.0, the raw compute time is 6M/128M = 47ms -- still
|
|
above 30ms. The nRF52840 simply cannot meet this SLA for DS-CNN.
|
|
|
|
</details>
|
|
|
|
### Discussion
|
|
|
|
This result surprises many students: a "tiny" 26K-parameter model still
|
|
cannot meet a 30ms SLA on a microcontroller. What does this teach about
|
|
the relationship between model size and latency? Try the ESP32-S3
|
|
(`Hardware.Tiny.ESP32_S3`) which has ~20x the compute throughput.
|
|
Does it meet 30ms? What if you needed to run MobileNetV2 -- try
|
|
`Engine.solve(Models.Vision.MobileNetV2, hw, ...)` and observe
|
|
the `feasible` field for the memory wall.
|
|
|
|
---
|
|
|
|
## Exercise 8 — Capstone: Fleet Design Under Constraints (Parts 1-7)
|
|
|
|
**Learning Objective:** Design a complete serving fleet for Llama-3-70B
|
|
at 1000 QPS, within a $5M annual budget, deployed across two regions.
|
|
|
|
### Setup
|
|
|
|
```python
|
|
from mlsysim import (
|
|
ServingModel, EconomicsModel, SustainabilityModel, CompressionModel,
|
|
Hardware, Models, Infra
|
|
)
|
|
from mlsysim.systems.types import Fleet
|
|
from mlsysim.systems.registry import Nodes, Fabrics
|
|
|
|
model = Models.Llama3_70B
|
|
serving = ServingModel()
|
|
econ = EconomicsModel()
|
|
sustain = SustainabilityModel()
|
|
compress = CompressionModel()
|
|
```
|
|
|
|
### Task
|
|
|
|
Design a fleet that meets ALL constraints simultaneously:
|
|
- **Throughput:** 1000 QPS (queries per second) total across two regions
|
|
- **Latency:** ITL < 50ms per token
|
|
- **Budget:** < $5M annual TCO
|
|
- **Carbon:** < 500 tonnes CO2/year
|
|
- **Regions:** US East + Quebec (for redundancy)
|
|
|
|
**Step 1:** Determine per-GPU serving capacity.
|
|
|
|
```python
|
|
# How many QPS per H100 at FP16?
|
|
r = serving.solve(model, Hardware.H100, seq_len=2048, batch_size=1, precision="fp16")
|
|
print(f"Single H100: TTFT={r.ttft:~P.1f} ITL={r.itl:~P.1f} feasible={r.feasible}")
|
|
|
|
# Try with INT4 compression
|
|
c = compress.solve(model, Hardware.H100, method="quantization", target_bitwidth=4)
|
|
print(f"INT4 compression: {c.compression_ratio:.1f}x accuracy_delta={c.accuracy_delta:.1f}%")
|
|
```
|
|
|
|
**Step 2:** Calculate fleet size needed for 1000 QPS.
|
|
|
|
```python
|
|
# Estimate: each H100 can decode ~X tokens/sec for this model
|
|
# tokens_per_sec_per_gpu = 1000 / itl_in_seconds
|
|
itl_sec = r.itl.to("s").magnitude
|
|
tokens_per_sec = 1.0 / itl_sec if itl_sec > 0 else 0
|
|
print(f"Tokens/sec per GPU: {tokens_per_sec:.1f}")
|
|
print(f"GPUs needed for 1000 QPS: ~{1000 / max(tokens_per_sec, 1):.0f}")
|
|
```
|
|
|
|
**Step 3:** Check budget and carbon for each region split.
|
|
|
|
```python
|
|
for us_pct in [0.3, 0.5, 0.7]:
|
|
qc_pct = 1.0 - us_pct
|
|
# (Adapt node count based on your QPS calculation above)
|
|
n_total = 100 # placeholder -- replace with your calculation
|
|
n_us = int(n_total * us_pct)
|
|
n_qc = n_total - n_us
|
|
|
|
fleet_us = Fleet(name="US East", node=Nodes.DGX_H100, count=max(1,n_us),
|
|
fabric=Fabrics.InfiniBand_NDR, region=Infra.US_Avg)
|
|
fleet_qc = Fleet(name="Quebec", node=Nodes.DGX_H100, count=max(1,n_qc),
|
|
fabric=Fabrics.InfiniBand_NDR, region=Infra.Quebec)
|
|
|
|
tco_us = econ.solve(fleet_us, duration_days=365, mfu=0.3)
|
|
tco_qc = econ.solve(fleet_qc, duration_days=365, mfu=0.3)
|
|
co2_us = sustain.solve(fleet_us, duration_days=365, mfu=0.3)
|
|
co2_qc = sustain.solve(fleet_qc, duration_days=365, mfu=0.3)
|
|
|
|
total_tco = tco_us.tco_usd + tco_qc.tco_usd
|
|
total_co2 = (co2_us.carbon_footprint_kg + co2_qc.carbon_footprint_kg) / 1000 # tonnes
|
|
print(f"Split {us_pct:.0%} US / {qc_pct:.0%} QC: "
|
|
f"TCO=${total_tco:,.0f} CO2={total_co2:.0f}t")
|
|
```
|
|
|
|
### Question
|
|
|
|
What is your recommended fleet configuration? How many total GPUs,
|
|
what regional split, and does INT4 quantization change your answer?
|
|
|
|
### Hint
|
|
|
|
Llama-3-70B at FP16 requires ~140 GB -- it does not fit on a single
|
|
H100 (80 GB). You must either use tensor parallelism (2+ GPUs per
|
|
inference instance) or quantize to INT4 (~35 GB, fits on one H100).
|
|
This fundamentally changes the fleet size calculation.
|
|
|
|
<details>
|
|
<summary><strong>Expected Answer</strong></summary>
|
|
|
|
This is an open-ended design exercise. A strong answer includes:
|
|
|
|
1. **Precision choice:** INT4 quantization (35 GB) fits on a single H100,
|
|
halving the GPU requirement vs. FP16 (which needs TP=2 minimum).
|
|
The 2-5% accuracy trade-off is usually acceptable for serving.
|
|
|
|
2. **Fleet size:** With INT4 and typical ITL of ~30-50ms per token, each
|
|
H100 can handle roughly 20-30 QPS. For 1000 QPS total, you need
|
|
approximately 35-50 GPUs (5-7 DGX H100 nodes).
|
|
|
|
3. **Regional split:** Placing 70% of capacity in Quebec dramatically
|
|
reduces carbon (hydro grid at ~1.2 gCO2/kWh vs. US average at
|
|
~390 gCO2/kWh) while keeping 30% in US East for latency-sensitive
|
|
users. Total CO2 stays well under 500 tonnes/year.
|
|
|
|
4. **Budget check:** 5-7 DGX H100 nodes at ~$200-300K each, plus
|
|
electricity and networking, fits within the $5M annual budget.
|
|
|
|
The key insight is that **compression is not just an optimization -- it is
|
|
an architectural decision** that changes the fleet design by 2x.
|
|
|
|
</details>
|
|
|
|
### Discussion
|
|
|
|
What are the failure modes of this design? What happens when a full
|
|
DGX node goes down in the Quebec region? How does the ReliabilityModel
|
|
inform your redundancy strategy? Should you over-provision by N+1 or N+2?
|