cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-08 02:28:25 -05:00

Author	SHA1	Message	Date
Vijay Janapa Reddi	1eb30f5f86	fix(mlsysim): harden release QA and paper artifacts Align the MLSys·im code, docs, paper, website, workflows, and lab wheel for the 0.1.1 release. This also fixes runtime/API issues found during release review and prepares the paper PDF plus archive package.	2026-04-25 10:06:01 -04:00
Vijay Janapa Reddi	0745a5fb73	mlsysim: align package identity copy with paper title Four user-facing identity statements referenced an earlier working title ("A Composable Analytical Framework for Machine Learning Systems") that no longer matches the actual paper title fixed in 0.1.1 ("MLSys·im: First-Principles Infrastructure Modeling for Machine Learning Systems"). Align each identity-claiming statement to the paper title. This covers user-facing name-claims only — the places where mlsysim describes itself. Descriptive uses of "analytical framework" as a technical category inside the paper and related technical prose are retained (they situate mlsysim among other analytical tools like Paleo, Calculon, Vidur; those uses are legitimate). - mlsysim/pyproject.toml : project description - mlsysim/mlsysim/cli/main.py : `mlsysim --help` text - mlsysim/docs/tutorials/index.qmd: tutorial landing blurb - mlsysim/tutorial/prerequisites.md: prerequisites preamble	2026-04-24 15:59:03 -04:00
Vijay Janapa Reddi	3ba3858b74	MLSys·im 0.1.0 release-prep audit (#1397 ) * docs(mlsysim): release-prep audit fixes for 0.1.0 Fixes the broken links, stale numerical claims, and naming inconsistencies surfaced by the 0.1.0 release-prep review. Output of the docs site now matches what the engine actually computes, internal navigation has no unresolved targets, and the Hatch announcement banner uses an absolute URL so sub-pages render the "Get started" link correctly. Notable changes: - Hero example on docs/index.qmd and getting-started.qmd now reflect the actual Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843). - Update Python version requirement (3.10+) and document the editable-install limitation (Hatch sources rewrite is not supported by editables). - Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter metadata, and the shared cross-site dropdown. - Add the four solvers missing from the quartodoc list (BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer) and surface the orphan tutorials (01_pipeline_callbacks, 02_differential_explainer, 12_design_space_exploration) in the sidebar. - Rename every reference to the now-deleted hello_world / llm_serving / sustainability / 11_full_stack_audit tutorials to their current filenames. - Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd no longer logs a citeproc warning. - Fix the CLI sample on the parent site/index.qmd card to use real model identifiers (Llama3_70B H100 --batch-size 1). - Soften the Colab/Binder copy until launch buttons are wired in. - Remove the duplicate "Differential Explainer" card on tutorials/index.qmd. * release(mlsysim): add 0.1.0 release notes and runbook - RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG with install/quickstart copy and a "known limitations & gotchas" section covering the editable-install issue, broken example scripts, and unpublished slide tag. - RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check, tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release, and post-release verification). - CHANGELOG.md: corrected the test count from 334 to the actual 367 currently passing on dev. * mlsysim: nest package layout, enable editable installs, clean lint Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`) so `pip install -e .` works out of the box. The previous flat layout used a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the `editables` backend cannot handle, breaking editable installs entirely. Packaging - pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`, add explicit `[tool.hatch.build.targets.sdist]` include list. - Wheel and sdist now contain only the package and project metadata (no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage). - Update `pyright.exclude` for nested layout. - Update GitHub source links in `docs/math.qmd` and `docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`. Lint configuration - Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores: `__init__.py` re-export pattern (F401/F403/F405/F811), `core/constants.py` star import from unit registry, tests/examples idioms. - `ruff check .` reports zero issues (down from 621). Real bug fixes uncovered by lint cleanup - `core/solver.py`: remove unused `from pydantic import BaseModel` that was being shadowed by the local `BaseModel = ForwardModel` alias. - `sim/simulations.py`: remove redundant local `Fleet` import that was shadowing the module-level import and triggering F823 (referenced before assignment) on the earlier `isinstance(..., Fleet)` check. - `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare `except:` clauses to specific exception types. - `tests/test_sota.py`: add the missing speculative-decoding ITL assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously computed but never compared. - `cli/commands/eval.py`: drop unused `is_json` local. - `labs/components.py`: drop unused `energy` placeholder local. Examples - `examples/06_multi_objective_pareto.py`: rewrite around the actual `BatchingOptimizerResult` API (which has no `pareto_front` attribute); build the front explicitly by sweeping batch sizes through `ServingModel` + `TailLatencyModel`, then highlight the optimum returned by `BatchingOptimizer`. - `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors (`f"\n[…]"` instead of an embedded literal newline) so the file imports on every supported Python version. Dev scripts - `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch from package-relative imports to absolute `from mlsysim... import` so they run cleanly under the nested layout. Docs / release notes - `docs/getting-started.qmd`: replace the editable-install caveat with `pip install -e ".[dev]"` (now supported). - `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries that this commit resolves (editable install, pareto example, gemini example). - `CHANGELOG.md`: add a "Packaging & Tooling" section describing the layout change and the resolver bug fixes. Verification - `python -m pytest tests/` → 367 passed (was 367, no regressions). - `ruff check .` → All checks passed. - `pip install -e .` → succeeds; live source picked up. - Fresh-venv wheel install + CLI smoke test → succeeds. - `examples/06_multi_objective_pareto.py` and `examples/gemini_design_loop.py` → both exit 0. * fix(mlsysim): repair docs build + lab test after nested-package restructure The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/` to support `pip install -e .`. Two CI jobs still depended on the old layout: 1. Docs build (`mlsysim-preview-dev`) — every tutorial and zoo page used a hand-rolled `importlib.util.spec_from_file_location` block to load `<repo>/mlsysim/__init__.py` directly from source. After the restructure, that path no longer exists. Replaced the hack in 17 docs/.qmd files with a plain `import mlsysim` — the package is already pip-installed in the docs build environment via `pip install ".[docs]"`. Updated the matching guidance in `contributing.qmd`. 2. Lab static tests — `test_no_localstorage_import` hard-coded `mlsysim/labs/state.py`; updated to the new nested path `mlsysim/mlsysim/labs/state.py`. Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation` passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.	2026-04-18 13:11:13 -04:00
Vijay Janapa Reddi	2e949f6574	fix(mlsysim): convert remaining dict accesses in distributed tutorial More result_dp["key"] → result_dp.key conversions in cells 4-6 (fabric comparison, pipeline sweep, and 3D parallelism tables). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 08:49:14 -04:00
Vijay Janapa Reddi	6ea8329a6c	fix(mlsysim): update distributed tutorial to use attribute access API The restored solver.py returns Pydantic result objects (attribute access) not dicts (subscript access). Fix result_dp["key"] → result_dp.key and node_performance → node_profile to match DistributedResult schema. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 08:44:06 -04:00
Vijay Janapa Reddi	611de228d9	fix(mlsysim): align docs with Model naming convention The solver.py refactoring renamed most solver classes from Solver to Model (e.g. DistributedSolver → DistributedModel). The docs still referenced the old names, causing the Quarto site build to fail with: ImportError: cannot import name 'DistributedSolver' from 'mlsysim' - Fix executable code cells in tutorials/distributed.qmd - Update non-executable code examples across 10 doc files - Rename 19 API reference files from Solver.qmd to *Model.qmd - SensitivitySolver and SynthesisSolver retain their names (correct) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 08:39:11 -04:00
Vijay Janapa Reddi	54096efc64	Merge feature/mlsysim-v0.1.0 into dev	2026-04-08 19:58:56 -04:00
Vijay Janapa Reddi	73f2906a38	refactor(mlsysim): core refactor with provenance, DSE, and docs updates Remove pedagogy module, add provenance tracking and design space exploration. Update evaluation engine, pipeline callbacks, and documentation including new tutorials.	2026-03-21 08:31:34 -04:00
Vijay Janapa Reddi	3e76c7cad6	fix(mlsysim): resolve pyproject.toml conflict, clean up legacy models, and sync paper anchors	2026-03-18 14:25:50 -04:00
Vijay Janapa Reddi	46fdae75b0	docs: auto-format print statements to tables across tutorials	2026-03-13 09:05:17 -04:00
Vijay Janapa Reddi	6f973091e1	docs(mlsysim): refactor tutorial 01 to use mlsysim.show utilities	2026-03-13 08:50:36 -04:00
Vijay Janapa Reddi	2bbe3e1a69	docs(mlsysim): redesign website, add 12 tutorials, and CLI entry points Replace 9 old tutorials with 12 new numbered tutorials (00-11) covering roofline through full-stack audit. Redesign landing page, add models-and-solvers and extending-the-engine guides. Add __main__.py, cli.py, and cli/ package for command-line interface.	2026-03-12 16:04:51 -04:00
Vijay Janapa Reddi	5c52507f27	feat(mlsysim): add prompt caching to ServingSolver and release-readiness fixes Add cached_prefix_len parameter to ServingSolver for prefix/prompt caching (grounded in Zheng et al. SGLang/RadixAttention). TTFT reduces proportionally to cache hit ratio; ITL and memory unchanged. Export 4 missing solvers from __init__.py (ContinuousBatchingSolver, WeightStreamingSolver, TailLatencySolver, CheckpointSolver). Fix dict-style access in for-engineers.qmd and architecture_comparison tutorial. Add math sections 3.4-3.6 for prompt caching, disaggregated serving (Patel et al. Splitwise ISCA'24), and speculative decoding (Leviathan et al. ICML'23) with literature citations. Update paper.tex Wall 4 description to include prompt caching. Fix remaining MLSYSIM branding in _quarto-html.yml.	2026-03-12 16:04:51 -04:00
Vijay Janapa Reddi	1b32571af7	docs(mlsysim): harmonize website with paper and add 5 tutorials Website-paper consistency: - Rename Operations to Ops across architecture, glossary, solver-guide - Fix Mermaid diagram arrows for progressive lowering - Add extensibility section to architecture page - Add workload types table to getting-started and zoo/models - Add Binding Constraint and Systems Wall to glossary - Expand sidebar to list all 10 tutorials New tutorials covering all 6 paper domains: - design_space.qmd: bottleneck regime map (Node domain) - data_pipeline.qmd: CPU bottleneck analysis (Data domain) - cot_economics.qmd: inference cost scaling (Algorithm domain) - sensitivity.qmd: binding constraint audit (Analysis domain) - architecture_comparison.qmd: GPU vs Cerebras (Node domain) Persona page updates: - for-students: expanded learning path to 8 tutorials - for-instructors: expanded course integration to 7 weeks - for-engineers: added sensitivity and architecture links	2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi	d594b4abd0	docs(mlsysim): expand to 22-wall taxonomy with paper rewrite and overview figure Expand walls.py from 17 to 22 walls, adding Serving (4), Batching (5), Streaming (6), Tail Latency (7), and Checkpoint (19). Update paper.tex with rewritten abstract, concrete LLaMA-3 motivating example, competitive positioning against Calculon/ASTRA-sim/Vidur, and new overview figure. Rebrand docs and tutorials to match.	2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi	8db12f0ee4	refactor(mlsysim): rebrand MLSYSIM to MLSys·im across paper and website Update display name from MLSYSIM to MLSys·im (with interpunct) in paper title, website config, and all 18 QMD documentation pages. Technical name (imports, file paths) remains lowercase mlsysim. Paper subtitle updated to "First-Principles Infrastructure Modeling for Machine Learning Systems". Preserve explicit anchor ID for cross-referenced #extending-mlsysim heading.	2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi	289e018223	refactor(mlsysim): typed results, wall taxonomy, and engineering naming - Add typed Pydantic result models (Layer A) replacing dict returns - Add canonical Wall taxonomy registry (walls.py) as single source of truth - Add Pipeline composer (Layer C) for solver chaining with explain()/run() - Rename domains: Metabolism→Node, Skeleton→Data, Mind→Algorithm, World→Fleet, Meta→Analysis - Rename MetabolismSolver→EfficiencySolver and MetabolismResult→EfficiencyResult - Update all solver classes with walls tuple referencing canonical wall numbers - Convert all dict access patterns to typed attribute access across codebase	2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi	7b145803c3	docs(mlsysim): update API docs, tutorials, and whitepaper for new architecture Rewrite API reference pages to match domain subpackage structure. Add solver doc pages for CompressionSolver, DataSolver, OrchestrationSolver, and ScalingSolver. Update whitepaper, math reference, getting-started guide, and tutorial index. Add extending tutorial for custom solvers.	2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi	b5bfc415d4	docs(mlsysim): document Training State, Sweep API, and Topology features	2026-03-08 15:20:54 -04:00
Vijay Janapa Reddi	aed43c5b81	docs: clean up landing page and centralize math foundations - Elevate 5-Layer Progressive Lowering mental model to architecture.qmd - Clean up landing page copy to be a punchy one-liner - Re-render architecture composition diagram as SVG for reliability - Move math derivations out of tutorials and into math.qmd with citations - Add DGX Spark to Silicon Zoo	2026-03-07 18:37:06 -05:00
Vijay Janapa Reddi	a78f1bd8b0	feat(mlsysim): add documentation site, typed registries, and 6-solver core Complete MLSYSIM v0.1.0 implementation with: - Documentation website (Quarto): landing page with animated hero and capability carousel, 4 tutorials (hello world, LLM serving, distributed training, sustainability), hardware/model/fleet/infra catalogs, solver guide, whitepaper, math foundations, glossary, and full quartodoc API reference - Typed registry system: Hardware (18 devices across 5 tiers), Models (15 workloads), Systems (fleets, clusters, fabrics), Infrastructure (grid profiles, rack configs, datacenters) - Core types: Pint-backed Quantity, Metadata provenance tracking, custom exception hierarchy (OOMError, SLAViolation) - SimulationConfig with YAML/JSON loading and pre-validation - Scenario system tying workloads to systems with SLA constraints - Multi-level evaluation scorecard (feasibility, performance, macro) - Examples, tests, and Jetson Orin NX spec fix (100 → 25 TFLOP/s) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 15:59:51 -05:00

21 Commits