mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-22 22:33:28 -05:00
Two issues caused the deployed slide PDFs to be unusable:
1. Every chapter .tex declared `\setsansfont{Helvetica Neue}` — proprietary
to Apple, not installed on the Ubuntu CI runner. xelatex bombed mid-frame,
the workflow's `|| true` swallowed the error, and the resulting PDF had
most text never typeset (blank pages with only logos/rules surviving).
Switch all 35 decks to TeX Gyre Heros (sans) and TeX Gyre Cursor (mono),
both bundled with texlive-fonts-extra — no external font downloads needed.
Drop the JetBrains Mono wget step and fonts-liberation from both slide
workflows accordingly.
2. Vol1 and Vol2 each ship `00_course_overview.pdf` and `01_introduction.pdf`.
The publish workflow uploaded them to a flat GitHub Release namespace, so
the second upload silently overwrote the first — clicking Vol I's Course
Overview actually downloaded Vol II's deck. Stage prefixed copies
(vol1_*.pdf, vol2_*.pdf) before upload, and update slides/vol{1,2}.qmd
plus the mlsysim cross-links to point at the new prefixed URLs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
273 lines
12 KiB
Plaintext
273 lines
12 KiB
Plaintext
---
|
|
title: "For Engineers & Researchers"
|
|
subtitle: "Back-of-envelope estimates before you provision hardware."
|
|
---
|
|
|
|
MLSYSIM gives you quick, type-safe analytical estimates for capacity planning, hardware selection, cost modeling, and sustainability analysis -- in seconds, from specifications alone. Every equation is grounded in peer-reviewed literature. Every hardware spec comes from a real datasheet.
|
|
|
|
---
|
|
|
|
## Why Use Analytical Models?
|
|
|
|
Before running expensive benchmarks or provisioning cloud instances, you need directional answers:
|
|
|
|
- **Will this model fit in GPU memory?** -- Check before renting the GPU
|
|
- **What's the expected TTFT for my LLM?** -- Estimate before building the serving stack
|
|
- **How many H100s do I actually need?** -- Model scaling efficiency before buying the cluster
|
|
- **What will this cost per year?** -- TCO analysis before signing the contract
|
|
- **How often will my training job crash?** -- Reliability modeling before committing to a 30-day run
|
|
- **What's the carbon footprint of this deployment?** -- Quantify before the sustainability review
|
|
|
|
MLSYSIM answers these in microseconds using first-order equations. It won't replace profiling, but it tells you *where to start profiling*.
|
|
|
|
::: {.callout-tip}
|
|
## Theory Behind the Tools
|
|
Each solver implements equations from the [Math Foundations](math.qmd) page. For the full conceptual framework, see the companion slide decks linked in the [Solver-to-Slides Map](#solver-to-slides-map) below.
|
|
:::
|
|
|
|
---
|
|
|
|
## Quick Start: Roofline Analysis
|
|
|
|
The `Engine` implements the [Roofline Performance Model](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf){target="_blank"} (Williams et al. 2009) to classify workloads as compute-bound or memory-bound.
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import Engine
|
|
|
|
# Single-node: Is ResNet-50 memory-bound on A100?
|
|
profile = Engine.solve(
|
|
model=mlsysim.Models.ResNet50,
|
|
hardware=mlsysim.Hardware.Cloud.A100,
|
|
batch_size=1, precision="fp16"
|
|
)
|
|
print(f"{profile.bottleneck}, {profile.latency.to('ms'):~.2f}")
|
|
print(f"MFU: {profile.mfu:.1%}, Arithmetic Intensity: {profile.arithmetic_intensity:~.2f}")
|
|
```
|
|
|
|
The returned `PerformanceProfile` gives you latency, throughput, bottleneck classification, Model FLOPs Utilization (MFU), arithmetic intensity, energy, and a feasibility flag -- everything you need for a first-pass hardware assessment.
|
|
|
|
---
|
|
|
|
## LLM Serving Analysis
|
|
|
|
The `ServingModel` models the [two-phase LLM inference lifecycle](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf){target="_blank"}: compute-bound pre-fill and memory-bound decoding.
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import ServingModel
|
|
|
|
serving = ServingModel()
|
|
result = serving.solve(
|
|
model=mlsysim.Models.Language.Llama3_70B,
|
|
hardware=mlsysim.Hardware.Cloud.H100,
|
|
seq_len=4096, batch_size=1
|
|
)
|
|
print(f"TTFT: {result.ttft.to('ms'):~.1f}")
|
|
print(f"ITL: {result.itl.to('ms'):~.2f}")
|
|
print(f"KV-cache: {result.kv_cache_size.to('GB'):~.1f}")
|
|
print(f"Feasible: {result.feasible}")
|
|
print(f"Mem util: {result.memory_utilization:.0%}")
|
|
```
|
|
|
|
The feasibility check tells you immediately whether the model plus its KV-cache fit in device memory -- before you discover the OOM at 3 AM in production.
|
|
|
|
---
|
|
|
|
## Hardware Sweep Pattern
|
|
|
|
Compare devices programmatically instead of reading datasheets:
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import Engine
|
|
|
|
model = mlsysim.Models.ResNet50
|
|
|
|
for hw in [mlsysim.Hardware.Cloud.H100,
|
|
mlsysim.Hardware.Cloud.A100,
|
|
mlsysim.Hardware.Cloud.T4,
|
|
mlsysim.Hardware.Edge.JetsonOrinNX]:
|
|
p = Engine.solve(model=model, hardware=hw, batch_size=32, precision="fp16")
|
|
print(f"{hw.name:20s} {p.bottleneck:16s} {p.latency.to('ms'):>8.2f~} {p.throughput:>8.0f} img/s")
|
|
```
|
|
|
|
---
|
|
|
|
## Distributed Training Analysis
|
|
|
|
The `DistributedModel` models [3D parallelism](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf){target="_blank"} (data, tensor, pipeline) with communication overhead from [ring all-reduce](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf){target="_blank"} and pipeline bubbles.
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import DistributedModel
|
|
|
|
dist = DistributedModel()
|
|
result = dist.solve(
|
|
model=mlsysim.Models.Language.Llama3_70B,
|
|
fleet=mlsysim.Systems.Clusters.Research_256,
|
|
batch_size=512, precision="fp16",
|
|
tp_size=8, pp_size=4, microbatch_count=16
|
|
)
|
|
print(f"Scaling efficiency: {result.scaling_efficiency:.1%}")
|
|
print(f"DP all-reduce: {result.dp_communication_latency.to('ms'):~.1f}")
|
|
print(f"TP overhead: {result.tp_communication_latency.to('ms'):~.1f}")
|
|
print(f"Pipeline bubble: {result.bubble_fraction:.1%}")
|
|
print(f"Step latency: {result.step_latency_total.to('ms'):~.1f}")
|
|
```
|
|
|
|
Tune `tp_size`, `pp_size`, and `microbatch_count` to find the parallelism configuration that maximizes scaling efficiency for your cluster topology.
|
|
|
|
---
|
|
|
|
## Composing Solvers for Real Questions
|
|
|
|
The core solvers are designed to chain. Here are three common engineering workflows.
|
|
|
|
### "Can I serve Llama-70B on H100s within budget?"
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import ServingModel, EconomicsModel
|
|
|
|
# Step 1: Does it fit and what's the latency?
|
|
serving = ServingModel()
|
|
result = serving.solve(
|
|
model=mlsysim.Models.Language.Llama3_70B,
|
|
hardware=mlsysim.Hardware.Cloud.H100,
|
|
seq_len=4096, batch_size=1
|
|
)
|
|
|
|
# Step 2: What does that fleet cost?
|
|
econ = EconomicsModel()
|
|
cost = econ.solve(
|
|
fleet=mlsysim.Systems.Clusters.Research_256,
|
|
duration_days=365,
|
|
kwh_price=0.08
|
|
)
|
|
print(f"Annual TCO: ${cost.tco_usd:,.0f}")
|
|
print(f" CapEx: ${cost.capex_usd:,.0f}")
|
|
print(f" OpEx: ${cost.total_opex_usd:,.0f}")
|
|
```
|
|
|
|
### "Where should I train to minimize carbon?"
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import SustainabilityModel
|
|
|
|
sustain = SustainabilityModel()
|
|
for grid in [mlsysim.Infra.Grids.Quebec, mlsysim.Infra.Grids.US_Avg,
|
|
mlsysim.Infra.Grids.Poland]:
|
|
r = sustain.solve(
|
|
fleet=mlsysim.Systems.Clusters.Research_256,
|
|
duration_days=30,
|
|
datacenter=grid
|
|
)
|
|
carbon_tons = r.carbon_footprint_kg / 1000.0
|
|
print(f"{grid.name:12s} {carbon_tons:8.1f} tCO2e "
|
|
f"PUE={r.pue:.2f} Water={r.water_usage_liters:,.0f} L")
|
|
```
|
|
|
|
For the theory behind PUE, carbon intensity, and the energy hierarchy, see the [Sustainable AI](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf){target="_blank"} slide deck.
|
|
|
|
### "How reliable is a 30-day training run on 256 GPUs?"
|
|
|
|
```python
|
|
import mlsysim
|
|
from mlsysim import ReliabilityModel
|
|
|
|
rel = ReliabilityModel()
|
|
result = rel.solve(
|
|
fleet=mlsysim.Systems.Clusters.Research_256,
|
|
job_duration_hours=720, # 30 days
|
|
checkpoint_time_s=120.0 # 2 minutes per checkpoint
|
|
)
|
|
print(f"Fleet MTBF: {result.fleet_mtbf.to('hour'):~.1f}")
|
|
print(f"P(failure before done): {result.failure_probability:.1%}")
|
|
print(f"Optimal ckpt interval: {result.optimal_checkpoint_interval.to('minute'):~.1f}")
|
|
print(f"Expected failures: {result.expected_failures:.1f}")
|
|
```
|
|
|
|
This solver implements the [Young-Daly checkpoint model](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_07_fault_tolerance.pdf){target="_blank"} -- essential for capacity planning on long training jobs.
|
|
|
|
---
|
|
|
|
## Writing Custom Solvers
|
|
|
|
Follow the built-in solver pattern to create your own analysis:
|
|
|
|
```python
|
|
from mlsysim.hardware.types import HardwareNode
|
|
|
|
class PowerEfficiencyModel:
|
|
def solve(self, hardware: HardwareNode) -> dict:
|
|
flops_per_watt = hardware.compute.peak_flops / hardware.tdp
|
|
return {
|
|
"device": hardware.name,
|
|
"flops_per_watt": flops_per_watt.to("TFLOPs/s/kW"),
|
|
}
|
|
```
|
|
|
|
See [Writing a Custom Solver](solver-guide.qmd#writing-a-custom-solver) for the full guide.
|
|
|
|
---
|
|
|
|
## Type Safety
|
|
|
|
All quantities are `pint.Quantity` objects. Unit conversions are explicit, and dimensional errors are caught at runtime:
|
|
|
|
```python
|
|
hw = mlsysim.Hardware.Cloud.A100
|
|
hw.compute.peak_flops.to("TFLOPs/s") # → 312.0 TFLOPs/s
|
|
hw.memory.bandwidth.to("TB/s") # → 2.0 TB/s
|
|
hw.memory.bandwidth.to("FLOP/s") # → DimensionalityError ✓
|
|
```
|
|
|
|
This means you can chain computations across solvers without worrying about unit mismatches -- `pint` catches them for you.
|
|
|
|
---
|
|
|
|
## Solver-to-Slides Map {#solver-to-slides-map}
|
|
|
|
Each MLSYSIM solver maps to specific chapters and slide decks from the [Machine Learning Systems](https://mlsysbook.ai) textbook. Use these for the full theoretical grounding behind each solver.
|
|
|
|
| MLSYSIM Solver | What It Models | Slide Deck |
|
|
|:---------------|:---------------|:-----------|
|
|
| Engine / SingleNodeModel | Roofline analysis, compute vs. memory bottleneck | [Hardware Acceleration (Vol I, Ch 11)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf){target="_blank"} |
|
|
| ServingModel | TTFT, ITL, KV-cache memory | [Model Serving (Vol I, Ch 13)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf){target="_blank"} and [Inference at Scale (Vol II, Ch 9)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_09_inference.pdf){target="_blank"} |
|
|
| DistributedModel | 3D parallelism, all-reduce, pipeline bubbles | [Distributed Training (Vol II, Ch 5)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf){target="_blank"} and [Collective Communication (Vol II, Ch 6)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_06_collective_communication.pdf){target="_blank"} |
|
|
| EconomicsModel | CapEx, OpEx, total cost of ownership | [Compute Infrastructure (Vol II, Ch 2)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf){target="_blank"} |
|
|
| SustainabilityModel | Energy, carbon footprint, water usage | [Sustainable AI (Vol II, Ch 15)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf){target="_blank"} |
|
|
| ReliabilityModel | MTBF, Young-Daly checkpointing | [Fault Tolerance (Vol II, Ch 7)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_07_fault_tolerance.pdf){target="_blank"} |
|
|
|
|
: {tbl-colwidths="[20,30,50]"}
|
|
|
|
:::: {.columns}
|
|
|
|
::: {.column width="50%"}
|
|
**Volume I: Foundations** (17 decks, 570 slides)
|
|
|
|
[Browse Vol I Decks](https://mlsysbook.ai/slides/vol1.html){target="_blank"} | [Download All (PDF)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/MLSysBook-Slides-Vol1-PDF.zip){target="_blank"}
|
|
:::
|
|
|
|
::: {.column width="50%"}
|
|
**Volume II: At Scale** (18 decks, 529 slides)
|
|
|
|
[Browse Vol II Decks](https://mlsysbook.ai/slides/vol2.html){target="_blank"} | [Download All (PDF)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/MLSysBook-Slides-Vol2-PDF.zip){target="_blank"}
|
|
:::
|
|
|
|
::::
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
- **[Getting Started](getting-started.qmd)** -- Install and run your first analysis
|
|
- **[Solver Guide](solver-guide.qmd)** -- Which solver for which question
|
|
- **[MLSys Zoo](zoo/index.qmd)** -- Browse all available hardware, model, and infrastructure specs
|
|
- **[API Reference](api/index.qmd)** -- Full programmatic API documentation
|
|
- **[Accuracy & Validation](accuracy.qmd)** -- How analytical bounds compare to empirical measurements
|
|
- **[Math Foundations](math.qmd)** -- The equations behind every solver
|
|
- **[All Slide Decks](https://mlsysbook.ai/slides/)** -- 35 Beamer decks with speaker notes and active learning exercises
|