mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 18:01:20 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
284 lines
10 KiB
Plaintext
284 lines
10 KiB
Plaintext
---
|
||
title: "Starving the GPU"
|
||
subtitle: "Your GPU can process 5,300 images per second. Your CPU decodes 850."
|
||
description: "Discover that the data pipeline — not the GPU — is often the binding constraint in training. Use DataModel and TransformationModel to find the crossover where CPU preprocessing stalls the accelerator."
|
||
categories: ["data", "intermediate"]
|
||
---
|
||
|
||
## The Question
|
||
|
||
You launch ResNet-50 training on an A100 and watch `nvidia-smi`. GPU utilization reads 40%.
|
||
You expected 95%. The model is compute-bound. The hardware is top-tier. **Why is your GPU
|
||
sitting idle 60% of the time?**
|
||
|
||
The answer is almost never the model or the GPU. It is the invisible pipeline upstream:
|
||
JPEG decoding, random cropping, color jitter, and normalization — all running on the CPU.
|
||
When the CPU cannot prepare batches fast enough, the GPU starves.
|
||
|
||
::: {.callout-note}
|
||
## Prerequisites
|
||
Complete [Tutorial 0: Hello, Roofline](00_hello_roofline.qmd) and
|
||
[Tutorial 1: The Memory Wall](01_memory_wall.qmd). You should understand memory-bound
|
||
vs. compute-bound regimes and how `Engine.solve` reports bottlenecks.
|
||
:::
|
||
|
||
::: {.callout-note}
|
||
## What You Will Learn
|
||
|
||
- **Measure** the GPU's step time in isolation using `SingleNodeModel`
|
||
- **Calculate** the data pipeline's throughput using `DataModel` and `TransformationModel`
|
||
- **Identify** the batch size crossover where the CPU becomes the binding constraint
|
||
- **Predict** how many CPU workers are needed to eliminate the data bottleneck
|
||
:::
|
||
|
||
::: {.callout-tip}
|
||
## Background: The Three Stages of a Training Step
|
||
|
||
Every training step has three sequential stages. The slowest one determines your actual
|
||
throughput — not the GPU alone:
|
||
|
||
1. **Storage I/O** (Wall 8) — Read raw data from disk into CPU memory
|
||
2. **CPU Preprocessing** (Wall 9) — Decode, resize, augment, normalize
|
||
3. **Accelerator Compute** (Wall 1) — Forward pass, backward pass, weight update
|
||
|
||
The GPU cannot start until stages 1 and 2 finish. If either is slower than the GPU, the
|
||
accelerator utilization drops below 100%. This is the data pipeline bottleneck.
|
||
:::
|
||
|
||
---
|
||
|
||
## 1. Setup
|
||
|
||
```{python}
|
||
#| echo: false
|
||
#| output: false
|
||
import mlsysim # installed via `pip install mlsysim` (see workflow)
|
||
Engine = mlsysim.Engine
|
||
```
|
||
|
||
```python
|
||
import mlsysim
|
||
from mlsysim import SingleNodeModel, DataModel, TransformationModel
|
||
```
|
||
|
||
---
|
||
|
||
## 2. GPU Compute Time: The Ceiling You Think You Have
|
||
|
||
We switch from LLM serving (Tutorials 2–3) to **CNN training** because the data pipeline
|
||
bottleneck is most visible here. LLM training on tokenized text has a tiny data footprint
|
||
(~8 MB/s as we will see in [Tutorial 12](12_full_stack_audit.qmd)). Image training with
|
||
JPEG decoding, resizing, and augmentation can demand 10–100× more CPU work per sample —
|
||
this is where the GPU actually starves.
|
||
|
||
First, establish how fast the A100 processes a ResNet-50 training step in isolation — no
|
||
data loading, no preprocessing, just pure compute:
|
||
|
||
```{python}
|
||
from mlsysim import SingleNodeModel
|
||
from mlsysim.core.constants import Q_
|
||
from mlsysim.show import table, info
|
||
|
||
model = mlsysim.Models.ResNet50
|
||
hardware = mlsysim.Hardware.Cloud.A100
|
||
solver = SingleNodeModel()
|
||
|
||
# Baseline: ResNet-50 on A100, batch 256, FP16
|
||
profile = solver.solve(model=model, hardware=hardware, batch_size=256, precision="fp16")
|
||
|
||
info("GPU Compute Baseline",
|
||
Model=model.name,
|
||
Hardware=hardware.name,
|
||
Batch_size=256,
|
||
Step_latency=profile.latency.to('ms'),
|
||
Throughput=f"{profile.throughput:.0f} img/s",
|
||
Bottleneck=profile.bottleneck)
|
||
```
|
||
|
||
The GPU can process this batch in tens of milliseconds. That is the ceiling. Now let's
|
||
check whether the data pipeline can keep up.
|
||
|
||
---
|
||
|
||
## 3. Storage I/O Check: Can the Disk Deliver?
|
||
|
||
ImageNet images average ~500 KB each (JPEG compressed). At batch 256, the GPU demands a
|
||
burst of data every step. Can the storage subsystem supply it?
|
||
|
||
```{python}
|
||
from mlsysim import DataModel
|
||
|
||
sample_size = Q_("500 KB") # Average ImageNet JPEG
|
||
batch_size = 256
|
||
|
||
# Data demand = batch_size x sample_size / step_time
|
||
step_time_s = profile.latency.to("s").magnitude
|
||
data_per_step = (batch_size * sample_size.to("GB")).magnitude
|
||
demand_rate = Q_(data_per_step / step_time_s, "GB/s")
|
||
|
||
data_solver = DataModel()
|
||
data_result = data_solver.solve(workload_data_rate=demand_rate, hardware=hardware)
|
||
|
||
info("Storage I/O Check",
|
||
Data_demand=f"{demand_rate:.3f}",
|
||
Storage_supply=f"{data_result.supply_bw:.2f}",
|
||
Utilization=f"{data_result.utilization:.1%}",
|
||
Is_stalled=data_result.is_stalled)
|
||
```
|
||
|
||
Storage I/O is fine — modern NVMe SSDs can deliver multi-GB/s easily. The bottleneck is
|
||
not reading the bytes. It is *transforming* them.
|
||
|
||
---
|
||
|
||
## 4. The Reveal: CPU Preprocessing Is the Wall
|
||
|
||
Even with fast storage, the CPU must decode JPEGs, apply random crops, color jitter, and
|
||
normalization. A typical CPU worker processes ImageNet images at ~250 MB/s. With 8 workers,
|
||
total CPU throughput is ~2 GB/s:
|
||
|
||
```{python}
|
||
from mlsysim import TransformationModel
|
||
|
||
transform_solver = TransformationModel()
|
||
cpu_throughput = Q_("2 GB/s") # 8 workers x 250 MB/s each
|
||
|
||
t = transform_solver.solve(
|
||
batch_size=256,
|
||
sample_size_bytes=sample_size,
|
||
cpu_throughput=cpu_throughput,
|
||
accelerator_step_time=profile.latency
|
||
)
|
||
|
||
info("CPU vs GPU Pipeline",
|
||
CPU_transform_time=t.transform_time,
|
||
GPU_step_time=t.accelerator_step_time,
|
||
CPU_is_bottleneck=t.is_bottleneck,
|
||
GPU_utilization=f"{t.accelerator_utilization:.1%}",
|
||
Slowdown_factor=f"{t.slowdown_factor:.2f}x")
|
||
```
|
||
|
||
::: {.callout-important}
|
||
## Key Insight
|
||
|
||
**The binding constraint is not silicon — it is JPEG decoding on the CPU.** The data
|
||
pipeline (Wall 9: Transformation) becomes the bottleneck before the GPU (Wall 1: Compute).
|
||
Your GPU can process 5,300+ images per second, but your 8 CPU workers can only prepare
|
||
~850. The GPU sits idle waiting for data. This is why production training pipelines use
|
||
GPU-accelerated preprocessing (NVIDIA DALI), pre-decoded datasets, or aggressive
|
||
prefetching.
|
||
:::
|
||
|
||
---
|
||
|
||
## 5. Batch Size Sweep: Finding the Crossover
|
||
|
||
Let's sweep batch sizes to find exactly where the CPU becomes the binding constraint. At
|
||
small batches, the GPU is slower and data arrives in time. At large batches, the GPU
|
||
becomes more efficient but the CPU falls behind:
|
||
|
||
```{python}
|
||
rows = []
|
||
for bs in [32, 64, 128, 256, 512, 1024]:
|
||
p = solver.solve(model=model, hardware=hardware, batch_size=bs, precision="fp16")
|
||
|
||
t = transform_solver.solve(
|
||
batch_size=bs,
|
||
sample_size_bytes=sample_size,
|
||
cpu_throughput=cpu_throughput,
|
||
accelerator_step_time=p.latency
|
||
)
|
||
|
||
binding = "Transformation" if t.is_bottleneck else p.bottleneck
|
||
rows.append([
|
||
bs,
|
||
f"{p.latency.to('ms').magnitude:.2f} ms",
|
||
f"{t.transform_time.to('ms').magnitude:.2f} ms",
|
||
binding,
|
||
f"{t.accelerator_utilization:.1%}"
|
||
])
|
||
|
||
table(["Batch", "GPU Step", "CPU Xform", "Binding", "GPU Util"], rows)
|
||
```
|
||
|
||
Watch the crossover: at small batch sizes the GPU is the bottleneck (100% utilization).
|
||
As batch size grows, CPU preprocessing time grows linearly while GPU step time grows
|
||
sub-linearly. Eventually Wall 9 becomes the binding constraint and GPU utilization drops.
|
||
|
||
---
|
||
|
||
## 6. The Fix: Adding CPU Workers
|
||
|
||
The simplest fix for a CPU bottleneck is more workers. Let's compare 8 vs. 16 vs. 32:
|
||
|
||
```{python}
|
||
rows = []
|
||
for n_workers in [8, 16, 32]:
|
||
cpu_tp = Q_(f"{n_workers * 250} MB/s")
|
||
|
||
p = solver.solve(model=model, hardware=hardware, batch_size=512, precision="fp16")
|
||
|
||
t = transform_solver.solve(
|
||
batch_size=512,
|
||
sample_size_bytes=sample_size,
|
||
cpu_throughput=cpu_tp,
|
||
accelerator_step_time=p.latency
|
||
)
|
||
|
||
rows.append([n_workers, cpu_tp.to('GB/s'), f"{t.accelerator_utilization:.1%}"])
|
||
|
||
table(["Workers", "Throughput", "GPU Util @ bs=512"], rows)
|
||
```
|
||
|
||
Doubling workers doubles throughput — but you eventually hit either storage I/O limits
|
||
(Wall 8) or PCIe bandwidth. The takeaway: always check *all three stages* of the pipeline.
|
||
|
||
---
|
||
|
||
## Your Turn
|
||
|
||
::: {.callout-caution}
|
||
## Exercises
|
||
|
||
**Exercise 1: Predict before you compute.**
|
||
At batch size 64 with 8 CPU workers (2 GB/s total), will ResNet-50 training on the A100
|
||
be GPU-bound or CPU-bound? Write your prediction, then run the code. What determines the
|
||
answer? (Hint: compare `transform_time` vs. `accelerator_step_time`.)
|
||
|
||
**Exercise 2: Medical imaging — larger samples.**
|
||
Medical imaging uses images 10x larger than ImageNet (~5 MB per sample). Change
|
||
`sample_size` to `Q_("5 MB")` and re-run the batch size sweep. At what batch size does
|
||
the CPU stall the GPU now? How many workers would you need to keep up at batch 256?
|
||
|
||
**Exercise 3: GPU-accelerated preprocessing.**
|
||
If you use NVIDIA DALI to move preprocessing to the GPU, the CPU bottleneck effectively
|
||
disappears. Model this by setting `cpu_throughput = Q_("50 GB/s")`. Run the sweep again.
|
||
Does the bottleneck shift back to compute? What is the new GPU utilization at batch 512?
|
||
|
||
**Self-check:** If the GPU step takes 20 ms and CPU preprocessing takes 35 ms, what is the
|
||
accelerator utilization? (Answer: 20/35 = 57%.)
|
||
:::
|
||
|
||
---
|
||
|
||
## Key Takeaways
|
||
|
||
::: {.callout-tip}
|
||
## Summary
|
||
|
||
- **Data pipelines have three stages**: storage I/O, CPU preprocessing, and GPU compute — the slowest determines throughput
|
||
- **CPU preprocessing (Wall 9)** is the most common bottleneck: JPEG decode, augmentation, and normalization are all CPU-bound
|
||
- **Batch size shifts the binding constraint**: small batches are GPU-bound; large batches often become CPU-bound
|
||
- **Adding CPU workers** helps linearly but has diminishing returns when storage I/O becomes the limit
|
||
- **Always check all three stages** before concluding that the GPU is the bottleneck
|
||
:::
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
- **[Quantization: Not a Free Lunch](05_quantization.qmd)** — When reducing precision helps (and when it doesn't)
|
||
- **[KV-Cache: The Hidden Tax](03_kv_cache.qmd)** — Another hidden memory consumer: the KV-cache in LLM serving
|
||
- **[Where to Invest](09_sensitivity.qmd)** — Use sensitivity analysis to decide whether more CPU workers or a faster GPU is the better investment
|
||
- **[Silicon Zoo](../zoo/hardware.qmd)** — Compare storage and interconnect specs across GPU platforms
|