mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-06 17:49:07 -05:00
* fix(content): clear two mitpress-above-below pre-commit failures The "📚 Book · ✅ Validate (Dev)" workflow has been failing on dev for 8+ consecutive runs because the mitpress-above-below pre-commit hook flags spatial references like "above"/"below" inside body prose and figure captions (the MIT Press style guide wants @sec-/@fig- cross-refs or "earlier"/"later" instead). Two pre-existing violations were tripping the hook on every push: - book/quarto/contents/vol1/responsible_engr/responsible_engr.qmd:1604 fig-cap for fig-data-governance-pillars said "obligations discussed below: privacy, security, compliance, and transparency" — but those four obligations are *immediately* listed in the same caption, so "discussed below" was redundant. Reworded to "obligations of privacy, security, compliance, and transparency …". - book/quarto/contents/vol2/network_fabrics/network_fabrics.qmd:1217 fig-cap for fig-congestion-cascade said "the PFC backpressure cascades described below." Reworded to "described later in this section." which is what the hook wants. After our 4 release-prep merges (PR-1/2/7/12) cleaned up the other hook failures (spelling, bibtex tidy, pipe tables, contractions, mitpress-vs-period, …), this was the last remaining failing hook. Verified locally: pre-commit run mitpress-above-below --all-files MIT Press: No above/below spatial refs (use cross-refs).....Passed These are pure copy-edits to figure captions; no semantic change to the diagrams or surrounding text. * fix(check-internal-links): suppress 4 categories of false positives The Tier 1 link checker (shipped in PR #1404) was over-eager and flagged author content as broken in four documented patterns: 1. TikZ source inside HTML comments. Link regex matched `\node[mycycle](B1)` as a Markdown link `[mycycle](B1)`. Fix: strip `<!-- ... -->` bodies before scanning, preserving line/column offsets so any *real* failure we report stays accurate. 2. Quarto cross-references like `[Foo](@sec-bar)`, `@fig-x`, `@tbl-y`. These resolve through the project xref index at render time, not the filesystem; book/binder owns that validation. Fix: skip targets whose first token is `@sec-/@fig-/@tbl-/@eq-/@lst-/@thm-/@cor-/@def-/@exr-/ @exm-/@prp-`. 3. Uppercase URL schemes (`HTTPS://`, `HTTP://`) — common after mobile auto-capitalize or copied citations. Fix: case-insensitive prefix match for the EXTERNAL_SCHEMES tuple. 4. GitHub-style emoji-prefix slugs in `.md` READMEs (e.g. `## 🎯 20 Progressive Modules` produces anchor `#-20-progressive-modules` on github.com, but Pandoc would slugify to `progressive-modules`). Fix: register both Pandoc-style and GitHub-style slugs as valid anchors so neither rendering target trips the checker. Drops repo-wide broken-link count from 150 → 84 (false positives only; no real link rot is masked). Real rot is fixed in a separate commit so the checker improvement can be reviewed independently. * fix(content): repair internal-link rot across 10 files Concrete link rot the new checker (PR #1404) surfaced once its false positives were cleared. None of these are stylistic; each link points at a path or anchor that does not exist. - README/README_{zh,ja,ko}.md (24 links): translation files live in README/ so paths to repo-root targets need a `../` prefix (`book/README.md` -> `../book/README.md`, etc.). - mlsysim/docs/contributing.qmd (21 links): `../slides/...` pointed inside `mlsysim/`; the slides root is two levels up (`../../slides/...`). - mlsysim/docs/cli-reference.qmd: `getting-started.qmd#bring-your-own-yaml-byoy` removed; retarget to `#defining-custom-models` (closest surviving section about user-supplied model specs). - mlsysim/docs/for-engineers.qmd, for-instructors.qmd: `solver-guide.qmd#extending-mlsysim` no longer exists; retarget to `#writing-a-custom-solver` (the surviving custom-solver guide). - book/tools/scripts/README.md: `../docs/BINDER.md` resolved to `book/tools/docs/BINDER.md` (nonexistent); the file actually lives at `book/docs/BINDER.md`, which is `../../docs/BINDER.md` from here. - book/quarto/contents/frontmatter/index.qmd: `about.qmd#about-the-book-unnumbered` anchor was removed when the About heading was simplified; drop the anchor so the link lands at the top of the page (which IS the About section). - tinytorch/datasets/tinytalks/README.md: `scripts/README.md` was never created; point at the directory listing instead. * chore(pre-commit): exclude 3 forward-looking files from internal-link checker Three files reference content that does not (yet) exist on the filesystem; the references are intentional rather than rot, so they should not block CI: - labs/index.qmd: lists the 33 planned labs (vol1/lab_00..lab_16, vol2/lab_01..lab_16) as a roadmap. Links go live as each lab ships. De-linking now would lose the visual roadmap. When a lab lands the exclusion narrows naturally on its own. - labs/PROTOCOL.md, labs/TEMPLATE.md: internal authoring docs that reference `../.claude/docs/labs/{PROTOCOL,TEMPLATE}.md`. The `.claude/` tree is per-worktree and not always present at the same relative path; these are author-tooling refs, not user-facing. Net effect: the link checker is now green on a clean checkout. The exclude block uses comments per existing convention so the rationale is discoverable from the config alone. * fix(content): clear codespell, contractions, and vs. pre-commit failures Three pre-existing pre-commit hooks were failing on the dev branch prior to the release-prep merges. Each is a small content normalization: - codespell (2): re-declares -> redeclares (book/quarto/config/shared/README.md); unparseable -> unparsable (handled in the check-internal-links rewrite). - contractions (2): * socratiq/socratiq.qmd callout: "If you're" -> "If you are". * nn_architectures fig-alt for the attention-visualization figure: "didn't" -> "did not". Alt-text is descriptive prose for screen readers, not a verbatim transcription of pixels, so expanding the contraction matches MIT Press style without changing the figure itself. - mitpress-vs-period (6): bare `vs` -> `vs.` per MIT Press 2026 §10.5 in benchmarking.qmd, distributed_training.qmd (x3 across two Python docstrings rendered in code listings), fault_tolerance.qmd, and inference.qmd. Code-listing strings are visible prose in the rendered PDF, so the rule applies there as well. * chore: bibtex-tidy auto-format outputs Outputs of the bibtex-tidy pre-commit hook (which auto-fixes its own input). Picked up here so that running pre-commit on a clean checkout no longer reports a "files were modified" failure for the same files on every invocation. Pure formatting; no entry semantics changed.
292 lines
7.0 KiB
Plaintext
292 lines
7.0 KiB
Plaintext
---
|
||
title: "CLI Reference"
|
||
subtitle: "Every command, every flag, with real examples."
|
||
---
|
||
|
||
MLSys·im ships an agent-ready CLI built on [Typer](https://typer.tiangolo.com/) and [Rich](https://rich.readthedocs.io/). It follows the [3-Tier Command Mapping](architecture.qmd): `eval` maps to Models, `optimize` maps to Optimizers, and `zoo` maps to the registries.
|
||
|
||
::: {.callout-tip}
|
||
## Output Formats
|
||
Every command supports `-o json` for machine-parseable output and `-o markdown` for reports. The default is `text` (human-readable Rich tables). AI agents should always use `-o json`.
|
||
:::
|
||
|
||
---
|
||
|
||
## Quick Examples
|
||
|
||
```bash
|
||
# What's in the Zoo?
|
||
mlsysim zoo hardware
|
||
mlsysim zoo models
|
||
|
||
# Single-node roofline: is Llama-3 8B memory-bound on H100?
|
||
mlsysim eval Llama3_8B H100
|
||
|
||
# Same thing, but with batch size 32 and fp8 precision
|
||
mlsysim eval Llama3_8B H100 --batch-size 32 --precision fp8
|
||
|
||
# Full cluster evaluation from a YAML spec
|
||
mlsysim eval cluster.yaml
|
||
|
||
# Machine-readable JSON for CI/CD pipelines
|
||
mlsysim eval Llama3_8B H100 -o json
|
||
|
||
# Export JSON Schema for IDE autocompletion
|
||
mlsysim schema --type hardware > hardware.schema.json
|
||
```
|
||
|
||
---
|
||
|
||
## Exit Codes
|
||
|
||
The CLI uses semantic exit codes so scripts and CI pipelines can react programmatically:
|
||
|
||
| Code | Meaning | Example |
|
||
|:-----|:--------|:--------|
|
||
| `0` | Success | Analysis completed, all assertions passed |
|
||
| `1` | Bad input | Unknown model name, malformed YAML, missing required flag |
|
||
| `2` | Physics violation | OOM — model does not fit in memory at the given precision |
|
||
| `3` | SLA violation | A `constraints.assert` check in the YAML failed |
|
||
|
||
```bash
|
||
mlsysim eval Llama3_70B T4 --batch-size 1
|
||
# Exit code 2: OOM — 140 GB model weights exceed 16 GB T4 memory
|
||
echo $? # → 2
|
||
```
|
||
|
||
---
|
||
|
||
## Global Options
|
||
|
||
```
|
||
mlsysim [OPTIONS] COMMAND [ARGS]...
|
||
```
|
||
|
||
| Flag | Description | Default |
|
||
|:-----|:-----------|:--------|
|
||
| `-o, --output` | Output format: `text`, `json`, `markdown` | `text` |
|
||
| `--install-completion` | Install shell completion (bash, zsh, fish) | — |
|
||
| `--show-completion` | Print completion script to stdout | — |
|
||
| `--help` | Show help and exit | — |
|
||
|
||
---
|
||
|
||
## `mlsysim zoo`
|
||
|
||
Explore the built-in registries (the MLSys Zoo).
|
||
|
||
```
|
||
mlsysim zoo [CATEGORY]
|
||
```
|
||
|
||
**Arguments:**
|
||
|
||
| Argument | Description |
|
||
|:---------|:-----------|
|
||
| `CATEGORY` | `hardware` or `models` |
|
||
|
||
**Examples:**
|
||
|
||
```bash
|
||
# List all hardware in the Zoo with specs
|
||
mlsysim zoo hardware
|
||
|
||
# List all models with parameter counts and FLOPs
|
||
mlsysim zoo models
|
||
|
||
# JSON output for scripting
|
||
mlsysim zoo hardware -o json
|
||
```
|
||
|
||
---
|
||
|
||
## `mlsysim eval`
|
||
|
||
Evaluate the analytical physics of an ML system. This is the primary command — it runs the roofline analysis and returns bottleneck, latency, throughput, and memory usage.
|
||
|
||
```
|
||
mlsysim eval [OPTIONS] TARGET [HARDWARE]
|
||
```
|
||
|
||
**Arguments:**
|
||
|
||
| Argument | Description | Required |
|
||
|:---------|:-----------|:---------|
|
||
| `TARGET` | Model name (e.g., `Llama3_8B`) or path to `mlsys.yaml` | Yes |
|
||
| `HARDWARE` | Hardware name (e.g., `H100`) — required when TARGET is a model name | Conditional |
|
||
|
||
**Options:**
|
||
|
||
| Flag | Description | Default |
|
||
|:-----|:-----------|:--------|
|
||
| `-b, --batch-size` | Batch size | `1` |
|
||
| `-p, --precision` | Numerical precision: `fp32`, `fp16`, `fp8`, `int8`, `int4` | `fp16` |
|
||
| `-e, --efficiency` | Model FLOPs Utilization (0.0–1.0) | `0.5` |
|
||
|
||
**Examples:**
|
||
|
||
```bash
|
||
# Quick check: is ResNet-50 memory-bound on A100?
|
||
mlsysim eval ResNet50 A100
|
||
|
||
# LLM inference at batch 1 (typical serving scenario)
|
||
mlsysim eval Llama3_8B H100 --batch-size 1 --precision fp16
|
||
|
||
# Quantized inference
|
||
mlsysim eval Llama3_8B H100 --batch-size 32 --precision int8 --efficiency 0.35
|
||
|
||
# Full cluster evaluation with SLA assertions
|
||
mlsysim eval cluster.yaml
|
||
|
||
# JSON for CI/CD — fails with exit code 3 if SLA assertions fail
|
||
mlsysim eval cluster.yaml -o json
|
||
```
|
||
|
||
### YAML Cluster Evaluation
|
||
|
||
When `TARGET` is a YAML file, `eval` runs the full 3-lens scorecard (Feasibility, Performance, Macro) including distributed training, economics, and sustainability analysis.
|
||
|
||
```yaml
|
||
version: "1.0"
|
||
workload:
|
||
name: "Llama3_70B"
|
||
batch_size: 4096
|
||
hardware:
|
||
name: "H100"
|
||
nodes: 64
|
||
ops:
|
||
region: "Quebec"
|
||
duration_days: 14.0
|
||
constraints:
|
||
assert:
|
||
- metric: "performance.latency"
|
||
max: 50.0
|
||
```
|
||
|
||
---
|
||
|
||
## `mlsysim schema`
|
||
|
||
Export JSON Schema for configuration files. Feed these to your IDE for autocompletion or to an LLM agent for structured generation.
|
||
|
||
```
|
||
mlsysim schema [OPTIONS]
|
||
```
|
||
|
||
**Options:**
|
||
|
||
| Flag | Description |
|
||
|:-----|:-----------|
|
||
| `--type` | Schema type: `hardware`, `workload`, or `plan` |
|
||
|
||
**Examples:**
|
||
|
||
```bash
|
||
# Get the hardware YAML schema for IDE autocompletion
|
||
mlsysim schema --type hardware > hardware.schema.json
|
||
|
||
# Get the workload schema
|
||
mlsysim schema --type workload > workload.schema.json
|
||
|
||
# Get the full cluster plan schema (for mlsys.yaml files)
|
||
mlsysim schema --type plan > plan.schema.json
|
||
```
|
||
|
||
---
|
||
|
||
## `mlsysim optimize`
|
||
|
||
Search the design space for optimal configurations. Each subcommand maps to an Optimizer in the 3-Tier architecture.
|
||
|
||
```
|
||
mlsysim optimize COMMAND [ARGS]...
|
||
```
|
||
|
||
### `mlsysim optimize parallelism`
|
||
|
||
Find the optimal (TP, PP, DP) split to maximize Model FLOPs Utilization.
|
||
|
||
```
|
||
mlsysim optimize parallelism CONFIG_FILE
|
||
```
|
||
|
||
| Argument | Description | Required |
|
||
|:---------|:-----------|:---------|
|
||
| `CONFIG_FILE` | Path to `mlsys.yaml` with fleet definition | Yes |
|
||
|
||
**Example:**
|
||
|
||
```bash
|
||
# Find the best parallelism strategy for a 70B model on 256 H100s
|
||
mlsysim optimize parallelism cluster.yaml
|
||
```
|
||
|
||
### `mlsysim optimize batching`
|
||
|
||
Find the maximum safe batch size that satisfies a P99 latency SLA.
|
||
|
||
```
|
||
mlsysim optimize batching [OPTIONS] CONFIG_FILE
|
||
```
|
||
|
||
| Flag | Description | Required |
|
||
|:-----|:-----------|:---------|
|
||
| `--sla-ms` | P99 latency SLA in milliseconds | Yes |
|
||
| `--qps` | Arrival rate in queries per second | Yes |
|
||
|
||
**Example:**
|
||
|
||
```bash
|
||
# Max batch size for 50ms P99 at 100 QPS
|
||
mlsysim optimize batching cluster.yaml --sla-ms 50 --qps 100
|
||
```
|
||
|
||
### `mlsysim optimize placement`
|
||
|
||
Find the optimal datacenter region to minimize TCO and carbon footprint.
|
||
|
||
```
|
||
mlsysim optimize placement [OPTIONS] CONFIG_FILE
|
||
```
|
||
|
||
| Flag | Description | Default |
|
||
|:-----|:-----------|:--------|
|
||
| `--carbon-tax` | Carbon tax penalty in $/ton CO₂ | `100.0` |
|
||
|
||
**Example:**
|
||
|
||
```bash
|
||
# Find cheapest region with $150/ton carbon penalty
|
||
mlsysim optimize placement cluster.yaml --carbon-tax 150
|
||
```
|
||
|
||
---
|
||
|
||
## `mlsysim audit`
|
||
|
||
Profile a workload against the Iron Law and report which wall binds.
|
||
|
||
```
|
||
mlsysim audit [OPTIONS]
|
||
```
|
||
|
||
| Flag | Description |
|
||
|:-----|:-----------|
|
||
| `--workload` | Workload name to audit |
|
||
|
||
---
|
||
|
||
## Bring Your Own YAML
|
||
|
||
Instead of using registry names, you can pass custom hardware or workload YAML files directly to `eval`:
|
||
|
||
```bash
|
||
# Custom chip spec against a Zoo model
|
||
mlsysim eval Llama3_8B ./my_custom_chip.yaml --batch-size 32
|
||
|
||
# Both custom
|
||
mlsysim eval ./my_model.yaml ./my_chip.yaml
|
||
```
|
||
|
||
See [Getting Started — Defining Custom Models](getting-started.qmd#defining-custom-models) for the model definition format.
|