cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-08 02:28:25 -05:00

Author	SHA1	Message	Date
Vijay Janapa Reddi	d9e57cc736	Merge dev into chore/bib-verify-sweep (taking dev prose for conflicts)	2026-05-06 07:22:04 -04:00
Vijay Janapa Reddi	a955b8142f	chore: untrack 5 build-output PDFs regenerated by Makefile/CI Five PDFs in the source tree are pure build artifacts that CI re-deploys at every run; the committed copies served no purpose beyond local-preview convenience and accumulated as stale snapshots. - mlsysim/docs/mlsysim-paper.pdf CI overwrites at deploy: mlsysim-publish-live.yml runs cp ./pdf-artifacts/paper.pdf to MLSYSIM_DOCS/mlsysim-paper.pdf. Local quarto preview now requires building the paper first (cd mlsysim/paper && make). - mlsysim/paper/figures/solver-chaining.pdf - periodic-table/paper/figures/{mamba,molecular_ml,periodic_table_hero}.pdf All FORCE-regenerated from SVG by the per-paper Makefiles whose own comment is the rationale: "a stale committed PDF cannot mask a freshly edited SVG." Drop the matching ! whitelist entries from .gitignore so the global *.pdf rule prevents accidental re-commit. Tutorial slide PDFs and callout icons remain whitelisted, those are sources not build outputs. Note: tinytorch/quarto/assets/downloads/00_tinytorch.pdf is NOT removed. Despite the slide-deck-like filename, no Beamer/Quarto source exists for it and big-picture.qmd consumes it directly via pdf.js viewer and download link. Treating it as a binary source asset until a source is authored or LFS Phase 2 is set up.	2026-05-06 06:30:56 -04:00
Vijay Janapa Reddi	dc54039f6c	fix(mlsysim): rename borgeaud2022 → hoffmann2022chinchilla in paper.tex Last leftover from the round-1 wrong-paper rename (Hoffmann is first author of the Chinchilla paper, Borgeaud is co-author #2). 5 cite sites in mlsysim/paper/paper.tex updated. Verified: 0 orphan cites across all 5 paper subprojects + 2 textbook volumes. bib_lint: 0 errors on all 7 bib files.	2026-05-05 21:29:26 -04:00
Vijay Janapa Reddi	c3921491e8	chore(bib): fix paper-subproject wrong-paper keys and corrupt entries Round 2 of the bib audit, covering paper subprojects (mlsysim, tinytorch, periodic-table, mlperf-edu) that the textbook-focused first pass deferred. Same pattern as round 1: surname/year prefixes did not match the entry's actual paper, plus several corrupt entries from Crossref misidentification. Renames: - mlsysim/{docs,paper}: barrett2024 -> zheng2024sglang (SGLang paper, Zheng is first author). - mlsysim/paper: zhao2025 -> deepseek2025v3 (DeepSeek-V3 ISCA paper, corporate author DeepSeek-AI). - tinytorch: key499f5624 -> tanenbaum1987os (hash-fallback for Tanenbaum OS textbook); fry1985 -> abelson1996sicp (SICP 2nd ed, Fry is not in author list); wooster1982 -> papert1980mindstorms (Mindstorms by Papert, Wooster not in author list); collins2018 -> collins1989apprenticeship (Cognitive Apprenticeship paper is 1989). - tinytorch + periodic-table: vaswani2025 -> vaswani2017attention (Attention paper is 2017; entries had a corrupt publisher and bogus DOI from Crossref misidentification). Body fixes accompanying renames: - tanenbaum1987os, abelson1996sicp, papert1980mindstorms: rebuilt as @book entries (were @article with stale review/journal DOIs). - vaswani2017attention: rebuilt with canonical NeurIPS 2017 metadata (Curran Associates, vol 30, pp 5998-6008); dropped corrupt DOI. Orphan deletions: - tinytorch keybe9561f4 (hash-fallback, no cite sites). - mlperf-edu vaswani2017attention (orphan). 21 cite-site updates across 4 paper subprojects. bib_lint reports 0 errors across all 5 modified bibs.	2026-05-05 20:21:04 -04:00
Vijay Janapa Reddi	c0241d2f80	chore(bib): fix wrong-paper keys, DOI dupes, and corrupt entries Per-file audit caught 14 cite keys whose surname prefix or year did not match the entry's actual paper, plus 4 DOI duplicates and 3 corrupted orphan entries. Renames preserve the cited paper; only the key changes. Renames (key -> first-author-surname-year-shortform): - vol2: agarwal2022 -> ouyang2022instructgpt; alistarh2024 -> ashkboos2024quarot; belkada2022 -> dettmers2022llmint8; borgeaud2022 -> hoffmann2022chinchilla; bosma2022 -> wei2022cot; ermon2023 -> rafailov2023dpo; koyejo2023 -> schaeffer2023mirage; nofal2023 -> beyer2016sre (year/publisher also corrected to O'Reilly 2016). - vol1: mccarthy2006 -> mccarthy1955dartmouth; krizhevsky2017 -> krizhevsky2012imagenet; zhang2021 -> zhang2017rethinking; ford2012 -> savage2009flaw; wonyoung_kim2008 -> kim2008dvfs; estrada2026 -> dehghani2022datamesh; michelucci2018 -> glorot2010xavier (entry was Michelucci textbook chapter, prose wanted Glorot/Bengio AISTATS 2010); chapelle2009 -> chapelle2006semisupervised (entry was 1-page IEEE review, prose wanted the actual MIT Press book). - interviews: key555befcd -> gierl2013automatic; chiang2023 -> zheng2023judging; boylan1989 -> tay2024interview (Grind 75 web resource); stenbeck1992 -> hambleton1991 (entry was 1992 review of the 1991 IRT book, content was the book). DOI dedup: - vol1 palmer1980 + palmer1980intel8087 -> palmer1980intel8087 (same paper, redirected cite, deleted dupe). - vol2 masanet2020 + masanet2020energy -> masanet2020energy (same paper, redirected cite, deleted dupe). - vol1 abadi2016tensorflow had wrong DOI pointing to the 2018 EuroSys Dynamic Control Flow paper; rebuilt as the OSDI 2016 TensorFlow paper it claims to be. Mirrored same correction into vol2's duplicate entry. Orphan deletions (zero cite sites, corrupted metadata): - vol1 acun2023; vol1 aggarwal2018; interviews gallifant2024 (the clean GPT-4 entry already exists at openai2023gpt4). - vol1 yu2018 (legitimate paper but unused). - vol2 mckinsey2018ai and triton.jit (orphans flagged for missing year; triton.jit was a false positive from a Python decorator inside a code block, not a citation). Field repairs: - aws2020s3: added year=2020, fixed corrupted author "A. W. Services" to {Amazon Web Services}, added howpublished + url. 51 cite-site updates across 25 files in vol1/vol2/interviews/mlsysim. All book-prose.md §5 cite-mechanics audit greps return zero hits. bib_lint reports 0 errors across all three modified bibs.	2026-05-05 20:00:54 -04:00
Vijay Janapa Reddi	5f94bf3b20	chore: complete bib sweep and fix three citation bugs Wraps up the bib-verify sweep across vol1, vol2, and the paper sub-projects, and corrects three citation issues introduced earlier in the branch: - Restore tang20211bit (1-bit Adam, Tang et al. ICML 2021) in vol2 bib and in collective_communication.qmd. The earlier sweep had renamed the cite to li2022, which now resolved to AlphaCode or 1-Bit LAMB. - Restore micikevicius2018mixed in vol1 bib to point at "Mixed Precision Training" (Micikevicius et al. ICLR 2018). The entry had been overwritten with an unrelated OpenSeq2Seq paper while the cite key stayed the same. - Drop the unused li2022 (AlphaCode) entry and the duplicate li2022 (1-Bit LAMB) entry from vol2 bib. Also remove eight same-paper duplicate entries that the sweep had left behind (vol1: lawson1979, gholami2022, lange2009, ribeiro2016; vol2: bursztein2024, rasley2020, sevilla2022, narayanan2019). After this commit the bibs have zero duplicate keys and zero orphan citations across both volumes and all five paper sub-projects.	2026-05-04 21:22:07 -04:00
Vijay Janapa Reddi	ba2942f4f8	chore: sweep bibs to MIT Press expectations	2026-05-04 13:24:23 -04:00
Vijay Janapa Reddi	7261b56de0	fix(refs): round 3 phase 1a+1b — 107 cited bib fixes	2026-05-03 14:27:29 -04:00
Vijay Janapa Reddi	046c832534	fix(refs): round 2 — 66 more bib audit fixes (catastrophic + cleanup)	2026-05-03 13:36:51 -04:00
Vijay Janapa Reddi	a2fe5b0cb0	fix(refs): apply 54 bib audit fixes from verification pass	2026-05-03 13:15:53 -04:00
Vijay Janapa Reddi	ce38fb6e50	docs(mlsysim/paper): tone down three claims to match what is verifiable After web-checking MLPerf v0.7 results, Meta's Llama 3 parallelism configuration, and Cerebras MemoryX specs, the previous edits overstated what the public sources actually support. - Anchor 1 (MLPerf v0.7 ResNet-50 DGX A100): the prior wording asserted a specific ~50-minute time-to-train and a specific 38,200 samples/s reported figure, neither of which I could verify against the MLPerf v0.7 results table (third-party comparisons cite ~28-29 minutes for 8x A100, which would imply a different sample rate). Replace the over-precise claim with an order-of-magnitude validation ("aggregate training rates in the same regime as our prediction"), and update tab:validation row 1 to "v0.7 same order" / "order-of-mag.". - Anchor 3 (Llama 3 parallelism): drop the specific "DP=4 at 131K context" qualifier. Meta published TP=8, PP=16, CP=16 for the long- context phase; the 38-43% MFU range applies to the main pretraining, which may use a different DP/CP. Keep only the dimensions (TP=8, PP=16) that are unambiguously published for the 16K-H100 fleet. - R1 case study (Cerebras MemoryX): replace "value reported in third- party performance studies" (which I did not actually identify) with "calibration estimate," since Cerebras has not published an official MemoryX bandwidth figure. No math or build changes. Page count unchanged at 29.	2026-04-27 18:09:44 -04:00
Vijay Janapa Reddi	73822a8e52	docs(mlsysim/paper): consolidate Tier 1/2/3 micro-subsections; correct Anchor 1 round - Pass 14 (consolidation): the three Tier-N subsections in section 5 were each a single paragraph. Fold them into \paragraph{} blocks under the section opener, leaving 5.1 Composition and 5.2 Scorecard as the only \subsection structure. The opener now also stitches in the cross-references that previously sat in a meta paragraph. - Anchor 1 (MLPerf ResNet-50 round): change "MLPerf Training v4.0" to "MLPerf Training v0.7" (matches the mlperf2020 citation year and the era when 8-GPU DGX A100 was the canonical ResNet-50 entry; v4.0 was H100-dominated and has no comparable A100-only submission). Reframe the 38,200 samples/s figure as a per-second throughput inferred from the published time-to-target (~50 min over the 90-epoch ImageNet schedule) rather than a directly reported samples/s metric.	2026-04-27 18:06:36 -04:00
Vijay Janapa Reddi	36120e81fc	fix(mlsysim/paper): correct three factual claims surfaced in accuracy review - Anchor 5 (Chinchilla): use the actual training compute budget C = 6 * 70B * 1.4T ~ 5.88e23 FLOPs instead of the rounded 5e23. With the correct budget the solver predicts P* ~ 70.0B, recovering the published 70B model size to <1%, instead of the artificial 7.1% gap that came purely from rounding the input. - Anchor 3 / Anchor 7 (Llama 3 405B parallelism): the previous "TP=8, PP=4, DP=512" configuration is not what Meta published. The Llama 3 paper and the ISCA'25 follow-up document TP=8, CP=16, PP=16 (with DP varying by sequence length). Update Anchor 3's fleet description to Meta's actual configuration and rewrite Anchor 7 to claim only what is defensible: the optimizer recovers the binding TP=8 intra-node constraint and the PP=16 memory-feasibility regime, not a bit-for-bit match including CP. Update tab:validation row 7 from "0.0%" to "qualitative". - R1 case study (Cerebras WSE-3): explicitly mark the 1.2 TB/s MemoryX injection bandwidth as an assumption from third-party studies, since Cerebras has not published an official figure. Page count unchanged at 29.	2026-04-27 18:03:50 -04:00
Vijay Janapa Reddi	5f75894e2b	docs(mlsysim/paper): editorial pass — em-dashes, colons, previews, layout Apply the same editorial pass used on the StaffML paper: - Pass 1 (US English): paper was already clean. - Pass 2 (em-dashes): replace seven stylistic "---" in body text with commas or parentheses; keep the "no dedicated wall" cell dash. - Pass 3 (colon-elaborations): rewrite ~30 instances of the StaffML "X: Y" pattern as separate sentences or commas, especially in the R3 case-study walkthrough and the Fallacies section. - Pass 4 (section previews): expand the openers of Architecture, Taxonomy, 3-Tier Resolver Architecture, and Validation so each multi-subsection section previews its subsections in prose. - Pass 5 (footnote audit): inline the two terminology footnotes about "node" and "single accelerator" into the body; keep the LP and Mars Climate Orbiter substantive asides. - Pass 8 (figure narrative): add a body reference and reading hint for fig:solver-chaining, which previously had no in-text mention. - Pass 9 (build hygiene): adopt the interviews/paper FORCE pattern so figures/%.pdf is always regenerated from its SVG source, not shadowed by a stale committed PDF; add make layout-review. - Pass 11 (bibliography): drop a "Best Paper Award" note flag and move an arXiv ID from a free-form note into a proper eprint field. - Pass 13 (roadmap): rewrite the end-of-introduction roadmap so it names every \section in order, including Architecture and Conclusion (previously only their subsections were listed). - Layout: wrap fig:carbon in \afterpage{...} so it lands on a fresh page instead of being crammed into the same column as fig:roofline. Page count: 28 -> 29.	2026-04-27 17:50:22 -04:00
Vijay Janapa Reddi	0289cdd561	fix(bib): restore auxiliary bib files affected by title-mangling Same regression as vol1/vol2 references.bib (commit `42bc54275` figure-audit feat) — five auxiliary bib files (interviews/paper, mlsysim/docs, mlsysim/paper, periodic-table/paper, tinytorch/paper) had brace patterns mangled in titles, e.g. 'Throughput-Latency Tradeoff in {LLM} Inference' became 'Throughput-Latency Tradeoff in {LLM}} Inference', which bibtex-tidy refuses to parse. Restored to the parent of `42bc54275` (state at `9ebdf77d0`) and re-formatted via the bib_apply_mechanical + bibtex-tidy hooks.	2026-04-27 15:14:55 -04:00
Vijay Janapa Reddi	42bc54275d	feat: add multimodal figure audit automation script and README	2026-04-27 13:35:48 -04:00
Vijay Janapa Reddi	9ebdf77d0a	Commit on references	2026-04-27 13:21:16 -04:00
Vijay Janapa Reddi	07d7dd4f07	docs(mlsysim): dedupe repeated prose in paper.tex Drop verbatim/near-duplicate lines: related-work close vs C2, validation vs intro η/Roofline, duplicate network-congestion bullet, conclusion that restated intro. Replace with cross-references so the story stays in one place.	2026-04-26 16:11:38 -04:00
Vijay Janapa Reddi	1a3747e544	fix(bib): verify references for interviews, mlsysim, and periodic-table papers - periodic-table: add seven missing @ entries cited in paper.tex; fix mlsys proceedings URL in ivanov2021data (unescaped path segment). - mlsysim: add arXiv url fields, replace escaped underscores in NeurIPS/MLSys x-verified-source URLs, point MLSys 2024 entries at abstract pages. - interviews: enrich mattson2020 (arXiv eprint, abstract URL) and unescape ETS x-verified-source.	2026-04-26 15:44:47 -04:00
Vijay Janapa Reddi	a610deec21	feat: verify and fix BibTeX for interviews and MLSysIM papers - interviews/paper: SWE-bench/ETS/VLDB/NeurIPS/MMLU metadata, figure rebuild and corpus script updates - mlsysim: Eisenman author list, ISCA x-verified-source DOIs, snell2025scaling, Narayanan/Pope/PaLM/MLPerf; docs/references.bib aligned with paper	2026-04-26 14:55:52 -04:00
Vijay Janapa Reddi	1eb30f5f86	fix(mlsysim): harden release QA and paper artifacts Align the MLSys·im code, docs, paper, website, workflows, and lab wheel for the 0.1.1 release. This also fixes runtime/API issues found during release review and prepares the paper PDF plus archive package.	2026-04-25 10:06:01 -04:00
Vijay Janapa Reddi	c3d4392580	fix(mlsysim/paper): move 22-walls table earlier to eliminate page-3 gap The wide table* for Table 1 (22 ML Systems Walls) was declared after the Introduction's wrap-up paragraph, so LaTeX could only float it to the top of page 4. Page 3 ended up with ~8 lines of orphaned text plus a ~85% blank gap. Move the table block to immediately follow its first citation paragraph. LaTeX now places it at the top of page 3, and the remaining intro text plus the opening of Section 2 (Related Work) fill the rest of the page. Net effect: page 3 is full, and the paper is 29 pages instead of 30. No prose changes — purely a source reorder.	2026-04-24 15:02:35 -04:00
Vijay Janapa Reddi	3ba3858b74	MLSys·im 0.1.0 release-prep audit (#1397 ) * docs(mlsysim): release-prep audit fixes for 0.1.0 Fixes the broken links, stale numerical claims, and naming inconsistencies surfaced by the 0.1.0 release-prep review. Output of the docs site now matches what the engine actually computes, internal navigation has no unresolved targets, and the Hatch announcement banner uses an absolute URL so sub-pages render the "Get started" link correctly. Notable changes: - Hero example on docs/index.qmd and getting-started.qmd now reflect the actual Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843). - Update Python version requirement (3.10+) and document the editable-install limitation (Hatch sources rewrite is not supported by editables). - Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter metadata, and the shared cross-site dropdown. - Add the four solvers missing from the quartodoc list (BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer) and surface the orphan tutorials (01_pipeline_callbacks, 02_differential_explainer, 12_design_space_exploration) in the sidebar. - Rename every reference to the now-deleted hello_world / llm_serving / sustainability / 11_full_stack_audit tutorials to their current filenames. - Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd no longer logs a citeproc warning. - Fix the CLI sample on the parent site/index.qmd card to use real model identifiers (Llama3_70B H100 --batch-size 1). - Soften the Colab/Binder copy until launch buttons are wired in. - Remove the duplicate "Differential Explainer" card on tutorials/index.qmd. * release(mlsysim): add 0.1.0 release notes and runbook - RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG with install/quickstart copy and a "known limitations & gotchas" section covering the editable-install issue, broken example scripts, and unpublished slide tag. - RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check, tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release, and post-release verification). - CHANGELOG.md: corrected the test count from 334 to the actual 367 currently passing on dev. * mlsysim: nest package layout, enable editable installs, clean lint Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`) so `pip install -e .` works out of the box. The previous flat layout used a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the `editables` backend cannot handle, breaking editable installs entirely. Packaging - pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`, add explicit `[tool.hatch.build.targets.sdist]` include list. - Wheel and sdist now contain only the package and project metadata (no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage). - Update `pyright.exclude` for nested layout. - Update GitHub source links in `docs/math.qmd` and `docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`. Lint configuration - Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores: `__init__.py` re-export pattern (F401/F403/F405/F811), `core/constants.py` star import from unit registry, tests/examples idioms. - `ruff check .` reports zero issues (down from 621). Real bug fixes uncovered by lint cleanup - `core/solver.py`: remove unused `from pydantic import BaseModel` that was being shadowed by the local `BaseModel = ForwardModel` alias. - `sim/simulations.py`: remove redundant local `Fleet` import that was shadowing the module-level import and triggering F823 (referenced before assignment) on the earlier `isinstance(..., Fleet)` check. - `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare `except:` clauses to specific exception types. - `tests/test_sota.py`: add the missing speculative-decoding ITL assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously computed but never compared. - `cli/commands/eval.py`: drop unused `is_json` local. - `labs/components.py`: drop unused `energy` placeholder local. Examples - `examples/06_multi_objective_pareto.py`: rewrite around the actual `BatchingOptimizerResult` API (which has no `pareto_front` attribute); build the front explicitly by sweeping batch sizes through `ServingModel` + `TailLatencyModel`, then highlight the optimum returned by `BatchingOptimizer`. - `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors (`f"\n[…]"` instead of an embedded literal newline) so the file imports on every supported Python version. Dev scripts - `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch from package-relative imports to absolute `from mlsysim... import` so they run cleanly under the nested layout. Docs / release notes - `docs/getting-started.qmd`: replace the editable-install caveat with `pip install -e ".[dev]"` (now supported). - `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries that this commit resolves (editable install, pareto example, gemini example). - `CHANGELOG.md`: add a "Packaging & Tooling" section describing the layout change and the resolver bug fixes. Verification - `python -m pytest tests/` → 367 passed (was 367, no regressions). - `ruff check .` → All checks passed. - `pip install -e .` → succeeds; live source picked up. - Fresh-venv wheel install + CLI smoke test → succeeds. - `examples/06_multi_objective_pareto.py` and `examples/gemini_design_loop.py` → both exit 0. * fix(mlsysim): repair docs build + lab test after nested-package restructure The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/` to support `pip install -e .`. Two CI jobs still depended on the old layout: 1. Docs build (`mlsysim-preview-dev`) — every tutorial and zoo page used a hand-rolled `importlib.util.spec_from_file_location` block to load `<repo>/mlsysim/__init__.py` directly from source. After the restructure, that path no longer exists. Replaced the hack in 17 docs/.qmd files with a plain `import mlsysim` — the package is already pip-installed in the docs build environment via `pip install ".[docs]"`. Updated the matching guidance in `contributing.qmd`. 2. Lab static tests — `test_no_localstorage_import` hard-coded `mlsysim/labs/state.py`; updated to the new nested path `mlsysim/mlsysim/labs/state.py`. Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation` passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.	2026-04-18 13:11:13 -04:00
Vijay Janapa Reddi	6734cacc13	Merge feat/mitpress-vol1-copyedit-r1: passes 16-19 + figure-audit pipeline Brings MIT Press copyedit round 1 work from passes 16-19 into dev: - pass 16: abbreviation first-use sweep + corrective closures; bib review of 8 flagged items (6 fabricated/wrong-author entries resolved, 2 autonomously verifiable items closed) - pass 17: move Von Neumann footnote per AU query; x-verify stamp applied to 1,203 bib entries across vol1/vol2 + 169 entries in paper/docs bibs; fix 30 grandfathered bib errors - pass 18: 86 Gemini-flagged issues (percent in captions, em dashes, contractions) - pass 19: above/below spatial refs + hyphen-range to en-dash sweep - Pre-commit infrastructure: 5 new MIT Press style checks - Figure-narrative audit pipeline: Gemini multimodal fact-check tool that produced the figure audit we are currently resolving chapter-by-chapter No conflicts detected with current dev state. # Conflicts: # book/quarto/contents/vol1/nn_architectures/nn_architectures.qmd # book/tools/bib_lint_baseline.json # interviews/paper/references.bib # periodic-table/paper/references.bib	2026-04-18 08:01:34 -04:00
Vijay Janapa Reddi	e0d64da7f9	fix(mlsysim): address reviewer feedback + improve landing page Paper: - Define "bind/binding" at first use with footnote - Clarify Table 2 caption and accelerator terminology - Rename "Progressive Lowering" to "Layered Input Stack" Website: - Remove decontextualized stats bar from hero - Move interactive carousel right under hero tagline - Reorder carousel slides for narrative flow - Fix broken tutorial links on landing page - Fix sidebar: tutorial 11 → 12, comment out missing Interactive Apps	2026-04-09 20:40:49 -04:00
Vijay Janapa Reddi	9ade073984	pass 17 bib: stamp x-verified on 169 entries in paper/docs bib files interviews: 47/47, mlperf-edu: 3/3, mlsysim/docs: 40/40, mlsysim/paper: 71/71, periodic-table: 25/25, tinytorch: 60/60. All repo bib files now carry x-verified markers.	2026-04-09 12:09:05 -04:00
Vijay Janapa Reddi	afc78f7bbd	pass 16 bib: close 2 human-review items I could verify autonomously Two bibliography fixes from the Pass 16 human-review backlog that had unambiguous verification evidence from the Phase 2 parallel-agent sweep and Crossref, so they did not require author judgment: 1. tinytorch/paper/references.bib: re-type tanenbaum1987minix Entry was typed @article but the cited work is A. S. Tanenbaum's 1987 book "Operating Systems: Design and Implementation" published by Prentice-Hall. The entry already had publisher and isbn fields (added during the Pass 16 parallel-agent bib sweep); only the type was wrong. One-character fix: @article → @book. 2. mlsysim/paper/references.bib: fix zhang2024llmcompass DOI collision The Phase 2 sweep (Agent F) detected that zhang2024llmcompass and patel2024splitwise had the same DOI in the source bib (10.1109/ISCA59077.2024.00060) — impossible since they are different papers. Agent F verified Splitwise's correct DOI is 10.1109/ISCA59077.2024.00019 via IEEE Xplore and applied the correction during the sweep. However, zhang2024llmcompass was left with the original DOI 10.1109/ISCA59077.2024.00060 pending verification. Crossref confirms that DOI belongs to "HEAP: A Fully Homomorphic Encryption Accelerator with Parallelized Bootstrapping" by Agrawal et al., NOT LLMCompass. Crossref returns the correct DOI for LLMCompass as 10.1109/ISCA59077.2024.00082 ("LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference"). This commit updates zhang2024llmcompass.doi to the verified Crossref value. Both files are now at 0 open bibliography-hygiene findings. The 6 remaining bibliography-hygiene human-review items still in the audit (3 fabricated entries + 3 wrong-author attributions) are NOT touched by this commit — they require author judgment about delete-vs-replace and re-attribution that only the author can make.	2026-04-08 19:47:28 -04:00
Vijay Janapa Reddi	fc75ec4932	paper hygiene: verify publisher/journal/doi across repo paper .bib files (73 entries) Parallel-agent bibliography verification sweep applied to the paper bibliography files outside the book proper. These are academic papers that live in the repo (mlsysim tutorial paper, tinytorch paper, interviews paper, periodic-table paper) and were previously only subject to bibtex-tidy formatting, not §5 hygiene validation. Batches F and G of the Pass 16 parallel sweep processed 77 entries total across 6 files; 73 auto-applied at HIGH+MEDIUM confidence. Per-file summary: mlsysim/paper/references.bib 50 entries applied (0 open) mlsysim/docs/references.bib 15 entries applied (0 open) tinytorch/paper/references.bib 7 entries applied (1 open) interviews/paper/references.bib 3 entries applied (0 open) periodic-table/paper/ref.bib 11 entries applied (0 open) Each applied entry carries: publisher or journal (primary field) + doi (when present on source) + x-verified = "2026-04-08" + x-verified-by = "pass-16-bib-sweep" + x-verified-source = <authoritative URL from DBLP, Crossref, arXiv, etc.> One open finding (intentional skip): tanenbaum1987minix — typed @article but the actual publication is A. S. Tanenbaum's 1987 book "Operating Systems: Design and Implementation" (Prentice-Hall), not a journal article. The fix is to re-type as @book, not fill a wrong `journal` field. Flagged for a future type-refactor pass. Cross-file duplicate keys are expected and correct: dao2022flashattention, mattson2020mlperf, and vaswani2017attention each appear in multiple paper .bib files because each paper independently cites these foundational works. Each copy was verified and annotated separately. This is the first pass that the repo-wide bib_lint + bibtex-tidy pre-commit hooks have been applied to these paper .bib files.	2026-04-08 18:25:41 -04:00
Vijay Janapa Reddi	4c31251b39	refactor: standardize paper directory structure across all three papers Consistent layout for StaffML, mlsysim, and TinyTorch papers: - figures/ for all visual assets (SVGs, PDFs, PNGs) - scripts/ for utility scripts (analysis, validation, benchmarks) - tables/ for standalone table .tex files (StaffML only) - Makefile at root for building (created one for mlsysim) Removed redundant build scripts (compile_paper.sh, build.sh) in favor of Makefiles. Deleted sort_app_matrix.py (no longer needed). Merged mlsysim images/ into figures/. Updated all references in paper.tex, Makefiles, and CI workflows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 15:57:55 -04:00
Vijay Janapa Reddi	efcc610c1f	fix(paper): final consistency pass — A100 312 TFLOPS, new features, multi-vendor - A100: 312 TFLOPS dense FP16 TC (624 with sparsity), eta=0.37, ~115 TFLOP/s achieved - Added multi-vendor hardware registry mention (MI250X, Gaudi 2/3, Trainium 2, TPU v6) - Added gradient accumulation, service_time_cv, goodput_ratio descriptions - Updated future work section - validate_anchors.py aligned with paper constants	2026-04-03 08:49:41 -04:00
Vijay Janapa Reddi	7b408bab63	fix(paper): final proofread — A100 sparse→dense throughout, eta 0.19→0.37 Critical: Paper referenced A100 FP16 as 312 TFLOPS (sparse with 2:4 structured sparsity). Code now uses 156 TFLOPS (dense). Updated: - Anchor 1: eta from 0.19 to 0.37 (58/156 dense, not 58/312 sparse) - Fallacy section: A100 "312" → "156 dense", FLOPS ratio 3.2x → 6.3x - Accuracy section: eta reference updated - validate_anchors.py: efficiency 0.19 → 0.37 - Fixed "three design principles" → "four" (counted wrong) - Removed duplicate setcounter declarations	2026-04-01 18:43:22 -04:00
Vijay Janapa Reddi	924d5897c5	fix(paper): correct batched decode equation — step-level W+B*KV, not W/B+KV	2026-04-01 18:17:58 -04:00
Vijay Janapa Reddi	2c78289c1c	feat(mlsysim): paper alignment, test audit, code improvements for v0.1.0 Paper: align equations with code fixes (hierarchical AllReduce M/G, TP activation communication, prefill attention O(S²), decode batch amortization, embodied carbon, compression inference speedup). Tests: fix self-referential empirical targets, add Phase 3 feature tests (inference speedup, goodput ratio, amortization, embodied carbon), rewrite test_engine.py with meaningful assertions. Code: add mlsysim.solvers namespace, trim __init__.py exports, add serve CLI command, Workload.lower() default, service_time_cv on TailLatencyModel, embodied_carbon_kg on HardwareNode.	2026-04-01 17:15:07 -04:00
Vijay Janapa Reddi	3e76c7cad6	fix(mlsysim): resolve pyproject.toml conflict, clean up legacy models, and sync paper anchors	2026-03-18 14:25:50 -04:00
Vijay Janapa Reddi	a6a3f6890e	docs(mlsysim): align paper code listings with updated hardware and model registries	2026-03-16 11:01:11 -04:00
Vijay Janapa Reddi	4630be8c5d	docs(mlsysim): add philosophy of ballpark estimation to paper validation section	2026-03-16 10:20:18 -04:00
Vijay Janapa Reddi	2d0d2d40ee	refactor(mlsysim): pedagogical hardening of unit checks, scorecard UI, and docs	2026-03-14 16:29:36 -04:00
Vijay Janapa Reddi	330ac24956	refactor(paper): tighten intro structure and relocate η discussion to validation - Move efficiency coefficient (η) explanation from intro to §6.3 (Accuracy Scope and Limitations) where anchors are presented - Add brief 3-sentence η bridge paragraph in intro with cross-ref - Move Patterson/Hennessy MIPS analogy from dimensional correctness paragraph to existing-tools paragraph where it fits naturally - Fix chain equation (eq:chain) overflow with resizebox - Each intro paragraph now carries exactly one point - Add \label{sec:accuracy} for cross-referencing	2026-03-14 12:47:26 -04:00
Vijay Janapa Reddi	3d0f96007a	feat(paper): add solver validation pipeline and align all 7 empirical anchors Add validate_anchors.py that runs all 7 empirical anchors through mlsysim solvers and compares output against paper.tex claims. Fix 4 mismatches: - Anchor 1: correct η from 0.49 to 0.19 (ResNet-50 can't saturate tensor cores) - Anchors 3/4: use system-level η (0.42/0.47) that captures stragglers, checkpointing, and thermal throttling beyond analytical communication model - Anchor 6: use Patterson's reported energy directly instead of TDP model - Anchor 7: add memory feasibility check to ParallelismOptimizer so configs where per-GPU weights+gradients exceed HBM are rejected Also convert all hardcoded inline numbers in paper.tex to pgfmath computed constants derived from base hardware specs (single source of truth).	2026-03-14 10:18:30 -04:00
Vijay Janapa Reddi	eda525b063	fix(mlsysim): resolve text overlaps in roofline crossover figure Move batch-size labels to the right of data points, reduce font sizes, and reposition Compute-Bound/Ridge labels to eliminate all text collisions.	2026-03-13 18:31:18 -04:00
Vijay Janapa Reddi	98e3f8e62b	docs(mlsysim): polish paper intro, fix SVG text alignment, add SVG auto-convert Expand Introduction with urgency framing, reproducibility/equity angle, target audience, and headline validation result. Remove bold run-in headings, replace all em-dashes with proper punctuation, and fix AI filler words throughout. Rewrite mlsysim-overview.svg to use text-anchor positioning instead of font-dependent transform+tspan offsets, fixing text overflow in all domain boxes when rendered via rsvg-convert. Add SVG-to-PDF auto-conversion step in build.sh so figures stay in sync.	2026-03-13 18:28:04 -04:00
Vijay Janapa Reddi	3b387fa7d3	docs(mlsysim): integrate 3-Tier resolver architecture into paper - Add Anchor 7 to validate Optimizer convergence against Llama 3 strategy. - Add Case Study R4 detailing automated parallelism search via Tier 3 Optimizer. - Expand Section 5.3 to explicitly define how Optimizers span across the 22 Walls taxonomy. - Update Future Work to reframe multi-objective searches as Tier 3 Pareto Frontiers. - Unify terminology globally: replace generic 'solvers' with 'resolvers' to respect the new 3-Tier semantics (Models, Solvers, Optimizers). - Update Listing 2 comments to map directly to Layer A (Demand) and Layer D (Supply).	2026-03-13 12:35:16 -04:00
Vijay Janapa Reddi	d8a8047bde	docs(mlsysim): update paper, SVG figures, and bibliography Tighten paper layout, add architecture-stack SVG, update all figure SVGs for consistency, expand references, and add related work guide.	2026-03-12 16:04:51 -04:00
Vijay Janapa Reddi	5c52507f27	feat(mlsysim): add prompt caching to ServingSolver and release-readiness fixes Add cached_prefix_len parameter to ServingSolver for prefix/prompt caching (grounded in Zheng et al. SGLang/RadixAttention). TTFT reduces proportionally to cache hit ratio; ITL and memory unchanged. Export 4 missing solvers from __init__.py (ContinuousBatchingSolver, WeightStreamingSolver, TailLatencySolver, CheckpointSolver). Fix dict-style access in for-engineers.qmd and architecture_comparison tutorial. Add math sections 3.4-3.6 for prompt caching, disaggregated serving (Patel et al. Splitwise ISCA'24), and speculative decoding (Leviathan et al. ICML'23) with literature citations. Update paper.tex Wall 4 description to include prompt caching. Fix remaining MLSYSIM branding in _quarto-html.yml.	2026-03-12 16:04:51 -04:00
Vijay Janapa Reddi	e38c7b9af8	docs(mlsysim): add SVG figures for paper and newsletter	2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi	cae8a9a503	fix(mlsysim): tighten paper layout and fix figure sizing - Fix solver-chaining PDF: re-export from SVG eliminating whitespace - Add titlesec spacing to tighten section gaps - Shorten wrapping headings for use cases - Rename Ops section heading to fit one line - Shrink figures and fix float placement - Move Figure 1 after contributions for better flow - Paper reduced from 23 to 22 pages	2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi	5aa518f8c1	fix(mlsysim): harmonize Ops terminology and clarify wall/solver count in paper Expert reviewer read-through identified four flow issues: - Line 267: "Operations" → "Ops" in Design Philosophy section - Line 589: "The Operations walls" → "The Ops walls" in taxonomy - Line 890: conclusion now says "21 solvers spanning 22 systems walls" - Removed extra blank line before Architecture section	2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi	707920f92c	docs(mlsysim): formal tone audit and SOTA solver features Paper: comprehensive formal tone review replacing informal/textbook language with academic paper register throughout. Removes italic wall hooks, rhetorical lists, conversational emphatics, and metaphorical phrasing. Enriches wall descriptions with concrete numbers and citations. Code: add ZeRO/FSDP sharding, LoRA, activation recomputation, compute/communication overlap, speculative decoding, and disaggregated serving support across engine, solver, and model types. Add SOTA test coverage.	2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi	25f5ca99a3	docs(mlsysim): improve paper structure and complete quartodoc config Paper improvements: - Fix Cerebras Wall 6 reference: bare text → \citealt{lie2023cerebras} - Move 22-wall summary table earlier (before wall descriptions, not after) - Introduce MLSys Zoo terminology (Silicon/Model/Fleet/Infrastructure Zoos) - Add appendix bridge sentence in solver formalism section Quartodoc: add all 7 missing solvers to config, ordered by wall number (EfficiencySolver, TransformationSolver, TopologySolver, InferenceScalingSolver, ResponsibleEngineeringSolver, SensitivitySolver, SynthesisSolver)	2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi	60506faa18	docs(mlsysim): restructure paper for 6-domain taxonomy and expand appendix Split Fleet domain into Fleet (multi-node coordination, W14-16) and Operations (economics, sustainability, safety, W17-20). Add inline citations for Walls 4, 8, 9, 10. Expand Wall 6 with compute-injection overlap rationale and Cerebras citation. Add persona framing paragraphs for use cases. Cut Lab Integration section, fold accessibility point into conclusion. Expand appendix from 4 to all 21 solvers by domain.	2026-03-12 16:04:50 -04:00

1 2

54 Commits