Files
cs249r_book/mlsysim/docs/getting-started.qmd
Vijay Janapa Reddi 85a58c65c2 fix(slides): repair blank-pages and Vol1/Vol2 collision in release PDFs
Two issues caused the deployed slide PDFs to be unusable:

1. Every chapter .tex declared `\setsansfont{Helvetica Neue}` — proprietary
   to Apple, not installed on the Ubuntu CI runner. xelatex bombed mid-frame,
   the workflow's `|| true` swallowed the error, and the resulting PDF had
   most text never typeset (blank pages with only logos/rules surviving).
   Switch all 35 decks to TeX Gyre Heros (sans) and TeX Gyre Cursor (mono),
   both bundled with texlive-fonts-extra — no external font downloads needed.
   Drop the JetBrains Mono wget step and fonts-liberation from both slide
   workflows accordingly.

2. Vol1 and Vol2 each ship `00_course_overview.pdf` and `01_introduction.pdf`.
   The publish workflow uploaded them to a flat GitHub Release namespace, so
   the second upload silently overwrote the first — clicking Vol I's Course
   Overview actually downloaded Vol II's deck. Stage prefixed copies
   (vol1_*.pdf, vol2_*.pdf) before upload, and update slides/vol{1,2}.qmd
   plus the mlsysim cross-links to point at the new prefixed URLs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:35:11 -04:00

301 lines
13 KiB
Plaintext

---
title: "Getting Started"
subtitle: "Install MLSYSIM and run your first analysis in under 5 minutes."
---
::: {.callout-note}
## Prerequisites
MLSYSIM assumes basic Python familiarity (variables, functions, `pip install`). No prior ML or hardware knowledge is required. Key concepts like **roofline analysis**, **memory-bound vs. compute-bound**, and **FLOP/s** are explained in context throughout the tutorials. For a full reference of terms, see the [Glossary](glossary.qmd).
:::
## Installation
MLSYSIM requires Python 3.10+ and installs cleanly with pip:
```bash
pip install mlsysim
```
For development or to follow along with tutorials locally:
```bash
git clone https://github.com/harvard-edge/cs249r_book
cd cs249r_book/mlsysim
pip install -e ".[dev]"
```
Verify the installation:
```bash
python -c "import mlsysim; print(mlsysim.__version__)"
```
::: {.callout-tip}
## Local install recommended for now
Tutorials are pure Python and run in any Python 3.10+ environment. Hosted **Google Colab** and **Binder** launch buttons are planned for a future release; until then, install locally with the steps above.
:::
---
## Your First Analysis
Once installed, you can run a complete roofline analysis in five lines. The roofline model is the foundation of ML systems performance reasoning -- it determines whether your workload is limited by compute (arithmetic units) or memory (data movement). For a visual walkthrough, see the [Hardware Acceleration slide deck (Vol I, Ch 11)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf){target="_blank"}.
```python
import mlsysim
from mlsysim import Engine
# 1. Load a model and hardware from the vetted Zoo
model = mlsysim.Models.ResNet50
hardware = mlsysim.Hardware.Cloud.A100
# 2. Solve -- the Engine applies the roofline model
profile = Engine.solve(model=model, hardware=hardware, batch_size=1, precision="fp16")
# 3. Read the results
print(f"Bottleneck: {profile.bottleneck}") # → 'Memory'
print(f"Latency: {profile.latency.to('ms'):~.2f}") # → 0.54 ms
print(f"Throughput: {profile.throughput:.0f}") # → 1843 / second
```
::: {.callout-note}
## Working with units
MLSYSIM uses the [Pint](https://pint.readthedocs.io/) library for physical units. All quantities carry attached units (ms, GB, TFLOP/s, etc.). Use `.to('ms')` to convert between units. Use `.magnitude` to extract the raw number when you need it for calculations or plotting.
:::
---
## Understanding the Output
`Engine.solve()` returns a `PerformanceProfile` -- a structured result containing everything the roofline model can tell you about your workload.
### Core fields
| Field | What it means |
|:------|:--------------|
| `bottleneck` | `'Memory'` or `'Compute'` -- which resource limits performance |
| `latency` | Time to process one batch, derived from the roofline ceiling |
| `throughput` | Samples per second = `batch_size / latency` |
| `latency_compute` | Time if only compute were the constraint |
| `latency_memory` | Time if only memory bandwidth were the constraint |
| `arithmetic_intensity` | Operations per byte -- the x-axis of the roofline plot |
### Extended fields
| Field | What it means |
|:------|:--------------|
| `energy` | Estimated energy consumption (Joules) |
| `memory_footprint` | Total memory required for the workload |
| `mfu` | Model FLOPs Utilization -- fraction of peak compute achieved |
| `feasible` | Whether the workload fits in device memory |
::: {.callout-tip}
## The key insight
If `latency_memory > latency_compute`, you are **memory-bound**: faster arithmetic units will not help.
You need to increase batch size, use a more compute-dense operation (e.g., fused attention), or reduce
data movement. If you are **compute-bound**, that is when parallelism and quantization pay off.
This is the same insight taught in the [Neural Network Computation slides (Vol I, Ch 5)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf){target="_blank"} and the [Performance Engineering slides (Vol II, Ch 10)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf){target="_blank"}.
:::
---
## Exploring the Zoo
MLSYSIM ships with vetted registries of hardware, models, infrastructure, and systems -- all sourced from real datasheets. Use tab-completion to explore.
### Hardware
Five tiers spanning the full deployment spectrum:
```python
# Cloud accelerators
mlsysim.Hardware.Cloud.A100
mlsysim.Hardware.Cloud.H100
mlsysim.Hardware.Cloud.H200
# Workstation / desktop GPUs
mlsysim.Hardware.Workstation.DGX_Spark
# Mobile processors
mlsysim.Hardware.Mobile.iPhone15Pro
mlsysim.Hardware.Mobile.Snapdragon8Gen3
# Edge devices
mlsysim.Hardware.Edge.JetsonOrinNX
# Tiny / microcontroller targets
mlsysim.Hardware.Tiny.ESP32
mlsysim.Hardware.Tiny.HimaxWE1
```
For the theory behind this hardware spectrum, see the [Compute Infrastructure slides (Vol II, Ch 2)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf){target="_blank"}.
### Models
Organized by application domain:
```python
# Language models
mlsysim.Models.Language.GPT2
mlsysim.Models.Language.Llama3_8B
mlsysim.Models.Language.Llama3_70B
# Vision models
mlsysim.Models.Vision.ResNet50
mlsysim.Models.Vision.MobileNetV2
mlsysim.Models.Vision.AlexNet
# Tiny / edge models
mlsysim.Models.Tiny.DS_CNN
mlsysim.Models.Tiny.WakeVision
```
### Infrastructure
Regional grids and datacenter configurations for sustainability analysis:
```python
# Regional power grids -- carbon intensity varies by energy source
mlsysim.Infra.Grids.Quebec # hydro: ~20 gCO2/kWh
mlsysim.Infra.Grids.US_Avg # mixed: ~390 gCO2/kWh
mlsysim.Infra.Grids.Poland # coal: ~820 gCO2/kWh
```
The [Sustainable AI slides (Vol II, Ch 15)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf){target="_blank"} explain why datacenter location is a first-class engineering decision.
### Systems
Cluster definitions for distributed analysis:
```python
# Network fabrics
mlsysim.Systems.Fabrics.InfiniBand_NDR
mlsysim.Systems.Fabrics.Ethernet_100G
# Pre-configured clusters
mlsysim.Systems.Clusters.Frontier_8K
mlsysim.Systems.Clusters.Research_256
```
For the full topology and cluster modeling, see the [Distributed Training slides (Vol II, Ch 5)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf){target="_blank"} and [Network Fabrics slides (Vol II, Ch 3)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_03_network_fabrics.pdf){target="_blank"}.
Complete registry listings are available in the [Zoo reference pages](zoo/index.qmd).
---
## Adjusting the Efficiency Parameter
The `efficiency` parameter (η) is the single most important tuning knob in
MLSYSIM. It represents the fraction of theoretical peak hardware performance
that is actually achieved in practice. Most GPUs run at 2--5% of peak without optimization; well-tuned workloads reach 35--55%.
```python
# Default: well-optimized training (η = 0.5)
profile_default = Engine.solve(
model=model, hardware=hardware,
batch_size=32, precision="fp16", efficiency=0.5
)
# Conservative: typical inference workload (η = 0.35)
profile_inference = Engine.solve(
model=model, hardware=hardware,
batch_size=32, precision="fp16", efficiency=0.35
)
print(f"Training estimate: {profile_default.latency}")
print(f"Inference estimate: {profile_inference.latency}")
```
Typical efficiency ranges:
| Scenario | η range | Notes |
|:---------|:--------|:------|
| Well-optimized training (fp16) | 0.35--0.55 | Megatron-LM, DeepSpeed |
| Inference (fp16) | 0.25--0.45 | vLLM, TensorRT-LLM |
| Inference (int8) | 0.20--0.40 | Quantized serving |
See the [Accuracy & Validation](accuracy.qmd) page for guidance on choosing η
for different scenarios. The gap between theoretical peak and achieved throughput is covered in detail in the [Performance Engineering slides (Vol II, Ch 10)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf){target="_blank"}.
---
## Defining Custom Models
You are not limited to the Zoo. Define any model by specifying its parameters
and FLOPs:
```python
from mlsysim import TransformerWorkload
from mlsysim import ureg
my_model = TransformerWorkload(
name="My-Custom-LLM",
architecture="Transformer",
parameters=13e9 * ureg.param,
layers=40,
hidden_dim=5120,
heads=40,
kv_heads=8,
inference_flops=2 * 13e9 * ureg.flop # Rule of thumb: ~2 FLOPs per parameter
)
profile = Engine.solve(model=my_model, hardware=hardware, batch_size=1)
print(f"Bottleneck: {profile.bottleneck}")
print(f"Latency: {profile.latency}")
print(f"Feasible: {profile.feasible}") # Does the model fit in device memory?
```
The [Model Compression slides (Vol I, Ch 10)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf){target="_blank"} explain why parameter count and precision together determine both the memory footprint and the arithmetic intensity of a workload.
---
## Companion Slide Decks
MLSYSIM is the hands-on companion to the [Machine Learning Systems](https://mlsysbook.ai) textbook. The concepts you model with MLSYSIM are taught visually in 35 Beamer slide decks (1,099 slides total) with speaker notes and active learning exercises.
| Concept in MLSYSIM | Slide Deck | Key Topics |
|:--------------------|:-----------|:-----------|
| `Engine.solve()` and the roofline model | [Hardware Acceleration (Vol I, Ch 11)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_11_hw_acceleration.pdf){target="_blank"} | Roofline model, arithmetic intensity, systolic arrays, memory wall |
| FLOPs, MACs, and compute cost | [Neural Network Computation (Vol I, Ch 5)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_05_nn_computation.pdf){target="_blank"} | Forward/backward pass cost, training memory breakdown |
| Training memory and mixed precision | [Model Training (Vol I, Ch 8)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_08_training.pdf){target="_blank"} | Iron Law of Training, gradient checkpointing, mixed precision |
| Quantization and compression | [Model Compression (Vol I, Ch 10)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_10_model_compression.pdf){target="_blank"} | Pruning, quantization, knowledge distillation |
| Hardware Zoo tiers | [Compute Infrastructure (Vol II, Ch 2)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_02_compute_infrastructure.pdf){target="_blank"} | Accelerator spectrum, HBM architecture, TCO |
| DistributedModel | [Distributed Training (Vol II, Ch 5)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_05_distributed_training.pdf){target="_blank"} | 3D parallelism, scaling efficiency, communication overhead |
| ServingModel and LLM inference | [Model Serving (Vol I, Ch 13)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_13_model_serving.pdf){target="_blank"} | TTFT, ITL, KV-cache, batching strategies |
| SustainabilityModel | [Sustainable AI (Vol II, Ch 15)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_15_sustainable_ai.pdf){target="_blank"} | Energy wall, carbon geography, PUE |
| Efficiency parameter (η) | [Performance Engineering (Vol II, Ch 10)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol2_10_performance_engineering.pdf){target="_blank"} | Operator fusion, FlashAttention, precision engineering |
| Benchmarking and validation | [Benchmarking (Vol I, Ch 12)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/vol1_12_benchmarking.pdf){target="_blank"} | MLPerf, measurement methodology, latency percentiles |
: {tbl-colwidths="[22,30,48]"}
:::: {.columns}
::: {.column width="50%"}
**[Volume I: Foundations](https://mlsysbook.ai/slides/vol1.html){target="_blank"}** -- 17 decks, 570 slides
[Download All PDFs (ZIP)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/MLSysBook-Slides-Vol1-PDF.zip){target="_blank"}
:::
::: {.column width="50%"}
**[Volume II: At Scale](https://mlsysbook.ai/slides/vol2.html){target="_blank"}** -- 18 decks, 529 slides
[Download All PDFs (ZIP)](https://github.com/harvard-edge/cs249r_book/releases/download/slides-latest/MLSysBook-Slides-Vol2-PDF.zip){target="_blank"}
:::
::::
---
## Next Steps
::: {.callout-tip}
## Recommended path
Follow the [structured learning path](tutorials/index.qmd) on the Tutorials page,
starting with the **[Hello, Roofline Tutorial](tutorials/00_hello_roofline.qmd)**. Each tutorial
pairs with a companion slide deck for visual explanations and active learning exercises.
For a complete reference of which solver to use for different questions, see the
**[Solver Guide](solver-guide.qmd)**.
:::