mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-08 18:01:20 -05:00

Files

Vijay Janapa Reddi 3ba3858b74 MLSys·im 0.1.0 release-prep audit (#1397 )

* docs(mlsysim): release-prep audit fixes for 0.1.0

Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.

Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
  Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
  limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
  metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
  (BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
  and surface the orphan tutorials (01_pipeline_callbacks,
  02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
  sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
  no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
  identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.

* release(mlsysim): add 0.1.0 release notes and runbook

- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
  with install/quickstart copy and a "known limitations & gotchas" section
  covering the editable-install issue, broken example scripts, and unpublished
  slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
  tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
  and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
  passing on dev.

* mlsysim: nest package layout, enable editable installs, clean lint

Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.

Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
  add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
  (no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
  `docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.

Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
  `__init__.py` re-export pattern (F401/F403/F405/F811),
  `core/constants.py` star import from unit registry,
  tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).

Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
  was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
  shadowing the module-level import and triggering F823 (referenced
  before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
  `except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
  assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
  computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.

Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
  `BatchingOptimizerResult` API (which has no `pareto_front` attribute);
  build the front explicitly by sweeping batch sizes through
  `ServingModel` + `TailLatencyModel`, then highlight the optimum
  returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
  (`f"\n[…]"` instead of an embedded literal newline) so the file imports
  on every supported Python version.

Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
  from package-relative imports to absolute `from mlsysim... import` so
  they run cleanly under the nested layout.

Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
  `pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
  that this commit resolves (editable install, pareto example, gemini
  example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
  layout change and the resolver bug fixes.

Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
  `examples/gemini_design_loop.py` → both exit 0.

* fix(mlsysim): repair docs build + lab test after nested-package restructure

The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:

1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
   a hand-rolled `importlib.util.spec_from_file_location` block to load
   `<repo>/mlsysim/__init__.py` directly from source. After the restructure,
   that path no longer exists. Replaced the hack in 17 docs/.qmd files with
   a plain `import mlsysim` — the package is already pip-installed in the
   docs build environment via `pip install ".[docs]"`. Updated the matching
   guidance in `contributing.qmd`.

2. **Lab static tests** — `test_no_localstorage_import` hard-coded
   `mlsysim/labs/state.py`; updated to the new nested path
   `mlsysim/mlsysim/labs/state.py`.

Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.

2026-04-18 13:11:13 -04:00

6.1 KiB

Raw Permalink Blame History

Tutorial Template — MLSys·im

This document defines the canonical structure for all MLSys·im tutorials. Every tutorial follows this template exactly. Consistency is pedagogical.

Design Principles

One tutorial = one core question. The title IS the question. Every cell exists to answer it.
Predict → Compute → Reflect. Before showing code output, tell the reader what to expect. After showing it, explain what it means. This is the learning cycle.
The "aha moment" is sacred. Every tutorial has exactly ONE insight that changes how the reader thinks. Frame it, build to it, land it in the Key Insight callout.
Real hardware, real models. No toy examples. Every tutorial uses published specs from the Zoo.
Code-first, prose-second. Keep prose tight. The insight comes from the numbers, not from reading paragraphs. Explanations serve the code, not the other way around.
Sub-30-second runtime. Every tutorial runs on a laptop in under 30 seconds. No GPU needed.

Canonical Structure

Every tutorial has exactly these sections, in this order:

---
title: "<Core Question as Statement>"
subtitle: "<One-sentence hook — what makes this surprising or important>"
description: "<2-sentence summary for search/SEO>"
---

## The Question                          ← 2-3 sentences framing WHY this matters
                                          No code. Pure motivation.

::: {.callout-note}
## Prerequisites
<What tutorials must be completed first. Link them.>
:::

::: {.callout-note}
## What You Will Learn
<Exactly 3–4 bullet points. Each starts with a verb. Measurable outcomes.>
:::

::: {.callout-tip}
## Background: <Concept Name>            ← Jargon explained for newcomers
<Define key terms needed for THIS tutorial. Keep it to one concept.>
:::

---

## 1. Setup                              ← Hidden import cell + visible 2-line import

## 2. <First Analysis Step>              ← Name describes what we DO, not what we learn
   - Code cell (3–8 lines, heavily commented)
   - Brief prose interpreting the output

## 3. <Second Analysis Step>             ← The sweep / comparison / composition
   - Code cell
   - Prose or callout interpreting the pattern

## 4. <The Reveal>                       ← Where the aha moment lands
   - Code cell showing the surprising result
   - KEY INSIGHT callout (see below)

::: {.callout-important}
## Key Insight
<1–3 sentences. The ONE thing the reader should remember from this tutorial.
This is the "tweet-length" summary. Bold the core claim.>
:::

## 5. <Extension / Composition>          ← OPTIONAL: chain another solver
   - Shows how this connects to the broader system

---

## Your Turn                             ← Always exactly 3 exercises

::: {.callout-caution}
## Exercises

**Exercise 1: Predict before you compute.**
<Always the first exercise. Forces the reader to form a hypothesis before running code.
Structure: predict → run → compare → explain the gap.>

**Exercise 2: Change one variable.**
<Modify a single parameter and predict the effect. Builds intuition for sensitivity.>

**Exercise 3: Connect to another domain.**
<Use a different solver or compose solvers. Shows that walls interact.>

**Self-check:** <One quick mental calculation to verify understanding.>
:::

---

## Key Takeaways

::: {.callout-tip}
## Summary
<Exactly 3–5 bullet points. Each maps to a "What You Will Learn" objective.
Use the same verb structure. The reader should be able to check them off.>
:::

---

## Next Steps
<3–4 links to related tutorials, organized by domain cluster.
Format: **[Tutorial Name](link)** — one-sentence description of what it adds.>

Callout Usage (Strict Rules)

Callout Type	Purpose	When to Use
`{.callout-note}`	Prerequisites, Background, What You Will Learn	Factual, informational
`{.callout-tip}`	Background concepts, Summary/Takeaways	Helpful context
`{.callout-important}`	Key Insight (the aha moment)	Exactly ONCE per tutorial
`{.callout-caution}`	Exercises	Always in "Your Turn" section
`{.callout-warning}`	Common mistakes / pitfalls	Only when there's a real trap

Code Cell Guidelines

Hidden setup cell: Always first. #| echo: false + #| output: false. Uses importlib path hack for dev, shows clean pip install import after.
Visible cells: 3–8 lines each. Every line has a comment or is self-explanatory.
Output formatting: Use f-strings with aligned columns for tables. Always include units.
Sweeps: Use a simple for loop with a print header. No pandas/matplotlib unless essential.
Variable names: Use domain terms (model, hardware, fleet, solver, result).

Naming Convention

Tutorials are numbered by cluster:

tutorials/
├── index.qmd                    ← Cluster-organized landing page
├── 00_hello_roofline.qmd        ← Cluster 0: Start Here
├── 01_memory_wall.qmd           ← Cluster 1: Node
├── 02_two_phases.qmd            ← Cluster 1: Node
├── 03_kv_cache.qmd              ← Cluster 1: Node
├── 04_starving_the_gpu.qmd      ← Cluster 2: Data
├── 05_quantization.qmd          ← Cluster 3: Algorithm
├── 06_scaling_1000_gpus.qmd     ← Cluster 4: Fleet
├── 07_geography.qmd             ← Cluster 5: Ops
├── 08_nine_million_dollar.qmd   ← Cluster 5: Ops
├── 09_sensitivity.qmd           ← Cluster 6: Analysis
├── 10_gpu_vs_wafer.qmd          ← Cluster 6: Analysis
├── 12_full_stack_audit.qmd      ← Cluster 7: Capstone
└── extending.qmd                ← Developer appendix (unnumbered)

Domain Cluster Tags

Every tutorial's YAML front matter includes a categories field for cluster membership:

categories: ["node", "beginner"]       # Cluster 1, difficulty level
categories: ["fleet", "advanced"]      # Cluster 4, difficulty level
categories: ["capstone", "advanced"]   # Cluster 7, difficulty level

Valid clusters: start, node, data, algorithm, fleet, ops, analysis, capstone Valid levels: beginner, intermediate, advanced

6.1 KiB Raw Permalink Blame History Unescape Escape