54 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
d9e57cc736 Merge dev into chore/bib-verify-sweep (taking dev prose for conflicts) 2026-05-06 07:22:04 -04:00
Vijay Janapa Reddi
a955b8142f chore: untrack 5 build-output PDFs regenerated by Makefile/CI
Five PDFs in the source tree are pure build artifacts that CI
re-deploys at every run; the committed copies served no purpose
beyond local-preview convenience and accumulated as stale snapshots.

- mlsysim/docs/mlsysim-paper.pdf
  CI overwrites at deploy: mlsysim-publish-live.yml runs
  cp ./pdf-artifacts/paper.pdf to MLSYSIM_DOCS/mlsysim-paper.pdf.
  Local quarto preview now requires building the paper first
  (cd mlsysim/paper && make).
- mlsysim/paper/figures/solver-chaining.pdf
- periodic-table/paper/figures/{mamba,molecular_ml,periodic_table_hero}.pdf
  All FORCE-regenerated from SVG by the per-paper Makefiles whose
  own comment is the rationale: "a stale committed PDF cannot mask
  a freshly edited SVG."

Drop the matching ! whitelist entries from .gitignore so the global
*.pdf rule prevents accidental re-commit. Tutorial slide PDFs and
callout icons remain whitelisted, those are sources not build outputs.

Note: tinytorch/quarto/assets/downloads/00_tinytorch.pdf is NOT
removed. Despite the slide-deck-like filename, no Beamer/Quarto
source exists for it and big-picture.qmd consumes it directly via
pdf.js viewer and download link. Treating it as a binary source
asset until a source is authored or LFS Phase 2 is set up.
2026-05-06 06:30:56 -04:00
Vijay Janapa Reddi
dc54039f6c fix(mlsysim): rename borgeaud2022 → hoffmann2022chinchilla in paper.tex
Last leftover from the round-1 wrong-paper rename (Hoffmann is first
author of the Chinchilla paper, Borgeaud is co-author #2). 5 cite
sites in mlsysim/paper/paper.tex updated.

Verified: 0 orphan cites across all 5 paper subprojects + 2 textbook
volumes. bib_lint: 0 errors on all 7 bib files.
2026-05-05 21:29:26 -04:00
Vijay Janapa Reddi
c3921491e8 chore(bib): fix paper-subproject wrong-paper keys and corrupt entries
Round 2 of the bib audit, covering paper subprojects (mlsysim,
tinytorch, periodic-table, mlperf-edu) that the textbook-focused first
pass deferred. Same pattern as round 1: surname/year prefixes did not
match the entry's actual paper, plus several corrupt entries from
Crossref misidentification.

Renames:
- mlsysim/{docs,paper}: barrett2024 -> zheng2024sglang (SGLang paper,
  Zheng is first author).
- mlsysim/paper: zhao2025 -> deepseek2025v3 (DeepSeek-V3 ISCA paper,
  corporate author DeepSeek-AI).
- tinytorch: key499f5624 -> tanenbaum1987os (hash-fallback for
  Tanenbaum OS textbook); fry1985 -> abelson1996sicp (SICP 2nd ed,
  Fry is not in author list); wooster1982 -> papert1980mindstorms
  (Mindstorms by Papert, Wooster not in author list); collins2018 ->
  collins1989apprenticeship (Cognitive Apprenticeship paper is 1989).
- tinytorch + periodic-table: vaswani2025 -> vaswani2017attention
  (Attention paper is 2017; entries had a corrupt publisher and bogus
  DOI from Crossref misidentification).

Body fixes accompanying renames:
- tanenbaum1987os, abelson1996sicp, papert1980mindstorms: rebuilt as
  @book entries (were @article with stale review/journal DOIs).
- vaswani2017attention: rebuilt with canonical NeurIPS 2017 metadata
  (Curran Associates, vol 30, pp 5998-6008); dropped corrupt DOI.

Orphan deletions:
- tinytorch keybe9561f4 (hash-fallback, no cite sites).
- mlperf-edu vaswani2017attention (orphan).

21 cite-site updates across 4 paper subprojects. bib_lint reports 0
errors across all 5 modified bibs.
2026-05-05 20:21:04 -04:00
Vijay Janapa Reddi
c0241d2f80 chore(bib): fix wrong-paper keys, DOI dupes, and corrupt entries
Per-file audit caught 14 cite keys whose surname prefix or year did not
match the entry's actual paper, plus 4 DOI duplicates and 3 corrupted
orphan entries. Renames preserve the cited paper; only the key changes.

Renames (key -> first-author-surname-year-shortform):
- vol2: agarwal2022 -> ouyang2022instructgpt; alistarh2024 ->
  ashkboos2024quarot; belkada2022 -> dettmers2022llmint8; borgeaud2022 ->
  hoffmann2022chinchilla; bosma2022 -> wei2022cot; ermon2023 ->
  rafailov2023dpo; koyejo2023 -> schaeffer2023mirage; nofal2023 ->
  beyer2016sre (year/publisher also corrected to O'Reilly 2016).
- vol1: mccarthy2006 -> mccarthy1955dartmouth; krizhevsky2017 ->
  krizhevsky2012imagenet; zhang2021 -> zhang2017rethinking; ford2012 ->
  savage2009flaw; wonyoung_kim2008 -> kim2008dvfs; estrada2026 ->
  dehghani2022datamesh; michelucci2018 -> glorot2010xavier (entry was
  Michelucci textbook chapter, prose wanted Glorot/Bengio AISTATS 2010);
  chapelle2009 -> chapelle2006semisupervised (entry was 1-page IEEE
  review, prose wanted the actual MIT Press book).
- interviews: key555befcd -> gierl2013automatic; chiang2023 ->
  zheng2023judging; boylan1989 -> tay2024interview (Grind 75 web
  resource); stenbeck1992 -> hambleton1991 (entry was 1992 review of the
  1991 IRT book, content was the book).

DOI dedup:
- vol1 palmer1980 + palmer1980intel8087 -> palmer1980intel8087 (same
  paper, redirected cite, deleted dupe).
- vol2 masanet2020 + masanet2020energy -> masanet2020energy (same paper,
  redirected cite, deleted dupe).
- vol1 abadi2016tensorflow had wrong DOI pointing to the 2018 EuroSys
  Dynamic Control Flow paper; rebuilt as the OSDI 2016 TensorFlow paper
  it claims to be. Mirrored same correction into vol2's duplicate entry.

Orphan deletions (zero cite sites, corrupted metadata):
- vol1 acun2023; vol1 aggarwal2018; interviews gallifant2024 (the clean
  GPT-4 entry already exists at openai2023gpt4).
- vol1 yu2018 (legitimate paper but unused).
- vol2 mckinsey2018ai and triton.jit (orphans flagged for missing year;
  triton.jit was a false positive from a Python decorator inside a code
  block, not a citation).

Field repairs:
- aws2020s3: added year=2020, fixed corrupted author "A. W. Services"
  to {Amazon Web Services}, added howpublished + url.

51 cite-site updates across 25 files in vol1/vol2/interviews/mlsysim.
All book-prose.md §5 cite-mechanics audit greps return zero hits.
bib_lint reports 0 errors across all three modified bibs.
2026-05-05 20:00:54 -04:00
Vijay Janapa Reddi
5f94bf3b20 chore: complete bib sweep and fix three citation bugs
Wraps up the bib-verify sweep across vol1, vol2, and the paper sub-projects,
and corrects three citation issues introduced earlier in the branch:

- Restore tang20211bit (1-bit Adam, Tang et al. ICML 2021) in vol2 bib and
  in collective_communication.qmd. The earlier sweep had renamed the cite
  to li2022, which now resolved to AlphaCode or 1-Bit LAMB.
- Restore micikevicius2018mixed in vol1 bib to point at "Mixed Precision
  Training" (Micikevicius et al. ICLR 2018). The entry had been overwritten
  with an unrelated OpenSeq2Seq paper while the cite key stayed the same.
- Drop the unused li2022 (AlphaCode) entry and the duplicate li2022 (1-Bit
  LAMB) entry from vol2 bib.

Also remove eight same-paper duplicate entries that the sweep had left
behind (vol1: lawson1979, gholami2022, lange2009, ribeiro2016; vol2:
bursztein2024, rasley2020, sevilla2022, narayanan2019).

After this commit the bibs have zero duplicate keys and zero orphan
citations across both volumes and all five paper sub-projects.
2026-05-04 21:22:07 -04:00
Vijay Janapa Reddi
ba2942f4f8 chore: sweep bibs to MIT Press expectations 2026-05-04 13:24:23 -04:00
Vijay Janapa Reddi
7261b56de0 fix(refs): round 3 phase 1a+1b — 107 cited bib fixes 2026-05-03 14:27:29 -04:00
Vijay Janapa Reddi
046c832534 fix(refs): round 2 — 66 more bib audit fixes (catastrophic + cleanup) 2026-05-03 13:36:51 -04:00
Vijay Janapa Reddi
a2fe5b0cb0 fix(refs): apply 54 bib audit fixes from verification pass 2026-05-03 13:15:53 -04:00
Vijay Janapa Reddi
ce38fb6e50 docs(mlsysim/paper): tone down three claims to match what is verifiable
After web-checking MLPerf v0.7 results, Meta's Llama 3 parallelism
configuration, and Cerebras MemoryX specs, the previous edits
overstated what the public sources actually support.

- Anchor 1 (MLPerf v0.7 ResNet-50 DGX A100): the prior wording
  asserted a specific ~50-minute time-to-train and a specific 38,200
  samples/s reported figure, neither of which I could verify against
  the MLPerf v0.7 results table (third-party comparisons cite ~28-29
  minutes for 8x A100, which would imply a different sample rate).
  Replace the over-precise claim with an order-of-magnitude validation
  ("aggregate training rates in the same regime as our prediction"),
  and update tab:validation row 1 to "v0.7 same order" / "order-of-mag.".

- Anchor 3 (Llama 3 parallelism): drop the specific "DP=4 at 131K
  context" qualifier. Meta published TP=8, PP=16, CP=16 for the long-
  context phase; the 38-43% MFU range applies to the main pretraining,
  which may use a different DP/CP. Keep only the dimensions
  (TP=8, PP=16) that are unambiguously published for the 16K-H100 fleet.

- R1 case study (Cerebras MemoryX): replace "value reported in third-
  party performance studies" (which I did not actually identify) with
  "calibration estimate," since Cerebras has not published an official
  MemoryX bandwidth figure.

No math or build changes. Page count unchanged at 29.
2026-04-27 18:09:44 -04:00
Vijay Janapa Reddi
73822a8e52 docs(mlsysim/paper): consolidate Tier 1/2/3 micro-subsections; correct Anchor 1 round
- Pass 14 (consolidation): the three Tier-N subsections in section 5
  were each a single paragraph. Fold them into \paragraph{} blocks
  under the section opener, leaving 5.1 Composition and 5.2 Scorecard
  as the only \subsection structure. The opener now also stitches in
  the cross-references that previously sat in a meta paragraph.

- Anchor 1 (MLPerf ResNet-50 round): change "MLPerf Training v4.0"
  to "MLPerf Training v0.7" (matches the mlperf2020 citation year and
  the era when 8-GPU DGX A100 was the canonical ResNet-50 entry; v4.0
  was H100-dominated and has no comparable A100-only submission).
  Reframe the 38,200 samples/s figure as a per-second throughput
  inferred from the published time-to-target (~50 min over the 90-epoch
  ImageNet schedule) rather than a directly reported samples/s metric.
2026-04-27 18:06:36 -04:00
Vijay Janapa Reddi
36120e81fc fix(mlsysim/paper): correct three factual claims surfaced in accuracy review
- Anchor 5 (Chinchilla): use the actual training compute budget
  C = 6 * 70B * 1.4T ~ 5.88e23 FLOPs instead of the rounded 5e23.
  With the correct budget the solver predicts P* ~ 70.0B,
  recovering the published 70B model size to <1%, instead of the
  artificial 7.1% gap that came purely from rounding the input.

- Anchor 3 / Anchor 7 (Llama 3 405B parallelism): the previous
  "TP=8, PP=4, DP=512" configuration is not what Meta published.
  The Llama 3 paper and the ISCA'25 follow-up document TP=8,
  CP=16, PP=16 (with DP varying by sequence length). Update
  Anchor 3's fleet description to Meta's actual configuration and
  rewrite Anchor 7 to claim only what is defensible: the optimizer
  recovers the binding TP=8 intra-node constraint and the PP=16
  memory-feasibility regime, not a bit-for-bit match including CP.
  Update tab:validation row 7 from "0.0%" to "qualitative".

- R1 case study (Cerebras WSE-3): explicitly mark the 1.2 TB/s
  MemoryX injection bandwidth as an assumption from third-party
  studies, since Cerebras has not published an official figure.

Page count unchanged at 29.
2026-04-27 18:03:50 -04:00
Vijay Janapa Reddi
5f75894e2b docs(mlsysim/paper): editorial pass — em-dashes, colons, previews, layout
Apply the same editorial pass used on the StaffML paper:

- Pass 1 (US English): paper was already clean.
- Pass 2 (em-dashes): replace seven stylistic "---" in body text with
  commas or parentheses; keep the "no dedicated wall" cell dash.
- Pass 3 (colon-elaborations): rewrite ~30 instances of the StaffML
  "X: Y" pattern as separate sentences or commas, especially in the
  R3 case-study walkthrough and the Fallacies section.
- Pass 4 (section previews): expand the openers of Architecture,
  Taxonomy, 3-Tier Resolver Architecture, and Validation so each
  multi-subsection section previews its subsections in prose.
- Pass 5 (footnote audit): inline the two terminology footnotes
  about "node" and "single accelerator" into the body; keep the LP
  and Mars Climate Orbiter substantive asides.
- Pass 8 (figure narrative): add a body reference and reading hint
  for fig:solver-chaining, which previously had no in-text mention.
- Pass 9 (build hygiene): adopt the interviews/paper FORCE pattern
  so figures/%.pdf is always regenerated from its SVG source, not
  shadowed by a stale committed PDF; add make layout-review.
- Pass 11 (bibliography): drop a "Best Paper Award" note flag and
  move an arXiv ID from a free-form note into a proper eprint field.
- Pass 13 (roadmap): rewrite the end-of-introduction roadmap so it
  names every \section in order, including Architecture and
  Conclusion (previously only their subsections were listed).
- Layout: wrap fig:carbon in \afterpage{...} so it lands on a fresh
  page instead of being crammed into the same column as fig:roofline.

Page count: 28 -> 29.
2026-04-27 17:50:22 -04:00
Vijay Janapa Reddi
0289cdd561 fix(bib): restore auxiliary bib files affected by title-mangling
Same regression as vol1/vol2 references.bib (commit 42bc54275 figure-audit
feat) — five auxiliary bib files (interviews/paper, mlsysim/docs,
mlsysim/paper, periodic-table/paper, tinytorch/paper) had brace patterns
mangled in titles, e.g. 'Throughput-Latency Tradeoff in {LLM} Inference'
became 'Throughput-Latency Tradeoff in {LLM}} Inference', which
bibtex-tidy refuses to parse.

Restored to the parent of 42bc54275 (state at 9ebdf77d0) and
re-formatted via the bib_apply_mechanical + bibtex-tidy hooks.
2026-04-27 15:14:55 -04:00
Vijay Janapa Reddi
42bc54275d feat: add multimodal figure audit automation script and README 2026-04-27 13:35:48 -04:00
Vijay Janapa Reddi
9ebdf77d0a Commit on references 2026-04-27 13:21:16 -04:00
Vijay Janapa Reddi
07d7dd4f07 docs(mlsysim): dedupe repeated prose in paper.tex
Drop verbatim/near-duplicate lines: related-work close vs C2, validation vs intro η/Roofline, duplicate network-congestion bullet, conclusion that restated intro. Replace with cross-references so the story stays in one place.
2026-04-26 16:11:38 -04:00
Vijay Janapa Reddi
1a3747e544 fix(bib): verify references for interviews, mlsysim, and periodic-table papers
- periodic-table: add seven missing @ entries cited in paper.tex; fix mlsys proceedings URL in ivanov2021data (unescaped path segment).

- mlsysim: add arXiv url fields, replace escaped underscores in NeurIPS/MLSys x-verified-source URLs, point MLSys 2024 entries at abstract pages.

- interviews: enrich mattson2020 (arXiv eprint, abstract URL) and unescape ETS x-verified-source.
2026-04-26 15:44:47 -04:00
Vijay Janapa Reddi
a610deec21 feat: verify and fix BibTeX for interviews and MLSysIM papers
- interviews/paper: SWE-bench/ETS/VLDB/NeurIPS/MMLU metadata, figure rebuild and corpus script updates
- mlsysim: Eisenman author list, ISCA x-verified-source DOIs, snell2025scaling, Narayanan/Pope/PaLM/MLPerf; docs/references.bib aligned with paper
2026-04-26 14:55:52 -04:00
Vijay Janapa Reddi
1eb30f5f86 fix(mlsysim): harden release QA and paper artifacts
Align the MLSys·im code, docs, paper, website, workflows, and lab wheel for the 0.1.1 release. This also fixes runtime/API issues found during release review and prepares the paper PDF plus archive package.
2026-04-25 10:06:01 -04:00
Vijay Janapa Reddi
c3d4392580 fix(mlsysim/paper): move 22-walls table earlier to eliminate page-3 gap
The wide table* for Table 1 (22 ML Systems Walls) was declared after the
Introduction's wrap-up paragraph, so LaTeX could only float it to the top
of page 4. Page 3 ended up with ~8 lines of orphaned text plus a ~85%
blank gap.

Move the table block to immediately follow its first citation paragraph.
LaTeX now places it at the top of page 3, and the remaining intro text
plus the opening of Section 2 (Related Work) fill the rest of the page.

Net effect: page 3 is full, and the paper is 29 pages instead of 30.
No prose changes — purely a source reorder.
2026-04-24 15:02:35 -04:00
Vijay Janapa Reddi
3ba3858b74 MLSys·im 0.1.0 release-prep audit (#1397)
* docs(mlsysim): release-prep audit fixes for 0.1.0

Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.

Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
  Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
  limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
  metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
  (BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
  and surface the orphan tutorials (01_pipeline_callbacks,
  02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
  sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
  no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
  identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.

* release(mlsysim): add 0.1.0 release notes and runbook

- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
  with install/quickstart copy and a "known limitations & gotchas" section
  covering the editable-install issue, broken example scripts, and unpublished
  slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
  tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
  and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
  passing on dev.

* mlsysim: nest package layout, enable editable installs, clean lint

Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.

Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
  add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
  (no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
  `docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.

Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
  `__init__.py` re-export pattern (F401/F403/F405/F811),
  `core/constants.py` star import from unit registry,
  tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).

Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
  was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
  shadowing the module-level import and triggering F823 (referenced
  before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
  `except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
  assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
  computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.

Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
  `BatchingOptimizerResult` API (which has no `pareto_front` attribute);
  build the front explicitly by sweeping batch sizes through
  `ServingModel` + `TailLatencyModel`, then highlight the optimum
  returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
  (`f"\n[…]"` instead of an embedded literal newline) so the file imports
  on every supported Python version.

Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
  from package-relative imports to absolute `from mlsysim... import` so
  they run cleanly under the nested layout.

Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
  `pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
  that this commit resolves (editable install, pareto example, gemini
  example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
  layout change and the resolver bug fixes.

Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
  `examples/gemini_design_loop.py` → both exit 0.

* fix(mlsysim): repair docs build + lab test after nested-package restructure

The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:

1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
   a hand-rolled `importlib.util.spec_from_file_location` block to load
   `<repo>/mlsysim/__init__.py` directly from source. After the restructure,
   that path no longer exists. Replaced the hack in 17 docs/.qmd files with
   a plain `import mlsysim` — the package is already pip-installed in the
   docs build environment via `pip install ".[docs]"`. Updated the matching
   guidance in `contributing.qmd`.

2. **Lab static tests** — `test_no_localstorage_import` hard-coded
   `mlsysim/labs/state.py`; updated to the new nested path
   `mlsysim/mlsysim/labs/state.py`.

Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
2026-04-18 13:11:13 -04:00
Vijay Janapa Reddi
6734cacc13 Merge feat/mitpress-vol1-copyedit-r1: passes 16-19 + figure-audit pipeline
Brings MIT Press copyedit round 1 work from passes 16-19 into dev:
- pass 16: abbreviation first-use sweep + corrective closures; bib review
  of 8 flagged items (6 fabricated/wrong-author entries resolved, 2
  autonomously verifiable items closed)
- pass 17: move Von Neumann footnote per AU query; x-verify stamp applied
  to 1,203 bib entries across vol1/vol2 + 169 entries in paper/docs bibs;
  fix 30 grandfathered bib errors
- pass 18: 86 Gemini-flagged issues (percent in captions, em dashes,
  contractions)
- pass 19: above/below spatial refs + hyphen-range to en-dash sweep
- Pre-commit infrastructure: 5 new MIT Press style checks
- Figure-narrative audit pipeline: Gemini multimodal fact-check tool that
  produced the figure audit we are currently resolving chapter-by-chapter

No conflicts detected with current dev state.

# Conflicts:
#	book/quarto/contents/vol1/nn_architectures/nn_architectures.qmd
#	book/tools/bib_lint_baseline.json
#	interviews/paper/references.bib
#	periodic-table/paper/references.bib
2026-04-18 08:01:34 -04:00
Vijay Janapa Reddi
e0d64da7f9 fix(mlsysim): address reviewer feedback + improve landing page
Paper:
- Define "bind/binding" at first use with footnote
- Clarify Table 2 caption and accelerator terminology
- Rename "Progressive Lowering" to "Layered Input Stack"

Website:
- Remove decontextualized stats bar from hero
- Move interactive carousel right under hero tagline
- Reorder carousel slides for narrative flow
- Fix broken tutorial links on landing page
- Fix sidebar: tutorial 11 → 12, comment out missing Interactive Apps
2026-04-09 20:40:49 -04:00
Vijay Janapa Reddi
9ade073984 pass 17 bib: stamp x-verified on 169 entries in paper/docs bib files
interviews: 47/47, mlperf-edu: 3/3, mlsysim/docs: 40/40,
mlsysim/paper: 71/71, periodic-table: 25/25, tinytorch: 60/60.
All repo bib files now carry x-verified markers.
2026-04-09 12:09:05 -04:00
Vijay Janapa Reddi
afc78f7bbd pass 16 bib: close 2 human-review items I could verify autonomously
Two bibliography fixes from the Pass 16 human-review backlog that had
unambiguous verification evidence from the Phase 2 parallel-agent sweep
and Crossref, so they did not require author judgment:

1. tinytorch/paper/references.bib: re-type tanenbaum1987minix

   Entry was typed @article but the cited work is A. S. Tanenbaum's
   1987 book "Operating Systems: Design and Implementation" published
   by Prentice-Hall. The entry already had publisher and isbn fields
   (added during the Pass 16 parallel-agent bib sweep); only the type
   was wrong. One-character fix: @article → @book.

2. mlsysim/paper/references.bib: fix zhang2024llmcompass DOI collision

   The Phase 2 sweep (Agent F) detected that zhang2024llmcompass and
   patel2024splitwise had the same DOI in the source bib
   (10.1109/ISCA59077.2024.00060) — impossible since they are
   different papers. Agent F verified Splitwise's correct DOI is
   10.1109/ISCA59077.2024.00019 via IEEE Xplore and applied the
   correction during the sweep.

   However, zhang2024llmcompass was left with the original DOI
   10.1109/ISCA59077.2024.00060 pending verification. Crossref
   confirms that DOI belongs to "HEAP: A Fully Homomorphic Encryption
   Accelerator with Parallelized Bootstrapping" by Agrawal et al.,
   NOT LLMCompass.

   Crossref returns the correct DOI for LLMCompass as
   10.1109/ISCA59077.2024.00082
   ("LLMCompass: Enabling Efficient Hardware Design for Large
   Language Model Inference").

   This commit updates zhang2024llmcompass.doi to the verified
   Crossref value.

Both files are now at 0 open bibliography-hygiene findings.

The 6 remaining bibliography-hygiene human-review items still in
the audit (3 fabricated entries + 3 wrong-author attributions) are
NOT touched by this commit — they require author judgment about
delete-vs-replace and re-attribution that only the author can make.
2026-04-08 19:47:28 -04:00
Vijay Janapa Reddi
fc75ec4932 paper hygiene: verify publisher/journal/doi across repo paper .bib files (73 entries)
Parallel-agent bibliography verification sweep applied to the paper
bibliography files outside the book proper. These are academic papers
that live in the repo (mlsysim tutorial paper, tinytorch paper,
interviews paper, periodic-table paper) and were previously only subject
to bibtex-tidy formatting, not §5 hygiene validation.

Batches F and G of the Pass 16 parallel sweep processed 77 entries
total across 6 files; 73 auto-applied at HIGH+MEDIUM confidence.

Per-file summary:
  mlsysim/paper/references.bib    50 entries applied (0 open)
  mlsysim/docs/references.bib     15 entries applied (0 open)
  tinytorch/paper/references.bib   7 entries applied (1 open)
  interviews/paper/references.bib  3 entries applied (0 open)
  periodic-table/paper/ref.bib    11 entries applied (0 open)

Each applied entry carries:
  publisher or journal (primary field) + doi (when present on source)
  + x-verified = "2026-04-08"
  + x-verified-by = "pass-16-bib-sweep"
  + x-verified-source = <authoritative URL from DBLP, Crossref, arXiv, etc.>

One open finding (intentional skip):
  tanenbaum1987minix — typed @article but the actual publication is
  A. S. Tanenbaum's 1987 book "Operating Systems: Design and
  Implementation" (Prentice-Hall), not a journal article. The fix is
  to re-type as @book, not fill a wrong `journal` field. Flagged for
  a future type-refactor pass.

Cross-file duplicate keys are expected and correct: dao2022flashattention,
mattson2020mlperf, and vaswani2017attention each appear in multiple
paper .bib files because each paper independently cites these
foundational works. Each copy was verified and annotated separately.

This is the first pass that the repo-wide bib_lint + bibtex-tidy
pre-commit hooks have been applied to these paper .bib files.
2026-04-08 18:25:41 -04:00
Vijay Janapa Reddi
4c31251b39 refactor: standardize paper directory structure across all three papers
Consistent layout for StaffML, mlsysim, and TinyTorch papers:
  - figures/ for all visual assets (SVGs, PDFs, PNGs)
  - scripts/ for utility scripts (analysis, validation, benchmarks)
  - tables/ for standalone table .tex files (StaffML only)
  - Makefile at root for building (created one for mlsysim)

Removed redundant build scripts (compile_paper.sh, build.sh) in
favor of Makefiles. Deleted sort_app_matrix.py (no longer needed).
Merged mlsysim images/ into figures/. Updated all references in
paper.tex, Makefiles, and CI workflows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 15:57:55 -04:00
Vijay Janapa Reddi
efcc610c1f fix(paper): final consistency pass — A100 312 TFLOPS, new features, multi-vendor
- A100: 312 TFLOPS dense FP16 TC (624 with sparsity), eta=0.37, ~115 TFLOP/s achieved
- Added multi-vendor hardware registry mention (MI250X, Gaudi 2/3, Trainium 2, TPU v6)
- Added gradient accumulation, service_time_cv, goodput_ratio descriptions
- Updated future work section
- validate_anchors.py aligned with paper constants
2026-04-03 08:49:41 -04:00
Vijay Janapa Reddi
7b408bab63 fix(paper): final proofread — A100 sparse→dense throughout, eta 0.19→0.37
Critical: Paper referenced A100 FP16 as 312 TFLOPS (sparse with 2:4
structured sparsity). Code now uses 156 TFLOPS (dense). Updated:
- Anchor 1: eta from 0.19 to 0.37 (58/156 dense, not 58/312 sparse)
- Fallacy section: A100 "312" → "156 dense", FLOPS ratio 3.2x → 6.3x
- Accuracy section: eta reference updated
- validate_anchors.py: efficiency 0.19 → 0.37
- Fixed "three design principles" → "four" (counted wrong)
- Removed duplicate setcounter declarations
2026-04-01 18:43:22 -04:00
Vijay Janapa Reddi
924d5897c5 fix(paper): correct batched decode equation — step-level W+B*KV, not W/B+KV 2026-04-01 18:17:58 -04:00
Vijay Janapa Reddi
2c78289c1c feat(mlsysim): paper alignment, test audit, code improvements for v0.1.0
Paper: align equations with code fixes (hierarchical AllReduce M/G,
TP activation communication, prefill attention O(S²), decode batch
amortization, embodied carbon, compression inference speedup).

Tests: fix self-referential empirical targets, add Phase 3 feature
tests (inference speedup, goodput ratio, amortization, embodied carbon),
rewrite test_engine.py with meaningful assertions.

Code: add mlsysim.solvers namespace, trim __init__.py exports, add
serve CLI command, Workload.lower() default, service_time_cv on
TailLatencyModel, embodied_carbon_kg on HardwareNode.
2026-04-01 17:15:07 -04:00
Vijay Janapa Reddi
3e76c7cad6 fix(mlsysim): resolve pyproject.toml conflict, clean up legacy models, and sync paper anchors 2026-03-18 14:25:50 -04:00
Vijay Janapa Reddi
a6a3f6890e docs(mlsysim): align paper code listings with updated hardware and model registries 2026-03-16 11:01:11 -04:00
Vijay Janapa Reddi
4630be8c5d docs(mlsysim): add philosophy of ballpark estimation to paper validation section 2026-03-16 10:20:18 -04:00
Vijay Janapa Reddi
2d0d2d40ee refactor(mlsysim): pedagogical hardening of unit checks, scorecard UI, and docs 2026-03-14 16:29:36 -04:00
Vijay Janapa Reddi
330ac24956 refactor(paper): tighten intro structure and relocate η discussion to validation
- Move efficiency coefficient (η) explanation from intro to §6.3
  (Accuracy Scope and Limitations) where anchors are presented
- Add brief 3-sentence η bridge paragraph in intro with cross-ref
- Move Patterson/Hennessy MIPS analogy from dimensional correctness
  paragraph to existing-tools paragraph where it fits naturally
- Fix chain equation (eq:chain) overflow with resizebox
- Each intro paragraph now carries exactly one point
- Add \label{sec:accuracy} for cross-referencing
2026-03-14 12:47:26 -04:00
Vijay Janapa Reddi
3d0f96007a feat(paper): add solver validation pipeline and align all 7 empirical anchors
Add validate_anchors.py that runs all 7 empirical anchors through mlsysim
solvers and compares output against paper.tex claims. Fix 4 mismatches:

- Anchor 1: correct η from 0.49 to 0.19 (ResNet-50 can't saturate tensor cores)
- Anchors 3/4: use system-level η (0.42/0.47) that captures stragglers,
  checkpointing, and thermal throttling beyond analytical communication model
- Anchor 6: use Patterson's reported energy directly instead of TDP model
- Anchor 7: add memory feasibility check to ParallelismOptimizer so configs
  where per-GPU weights+gradients exceed HBM are rejected

Also convert all hardcoded inline numbers in paper.tex to pgfmath computed
constants derived from base hardware specs (single source of truth).
2026-03-14 10:18:30 -04:00
Vijay Janapa Reddi
eda525b063 fix(mlsysim): resolve text overlaps in roofline crossover figure
Move batch-size labels to the right of data points, reduce font sizes,
and reposition Compute-Bound/Ridge labels to eliminate all text collisions.
2026-03-13 18:31:18 -04:00
Vijay Janapa Reddi
98e3f8e62b docs(mlsysim): polish paper intro, fix SVG text alignment, add SVG auto-convert
Expand Introduction with urgency framing, reproducibility/equity angle,
target audience, and headline validation result. Remove bold run-in
headings, replace all em-dashes with proper punctuation, and fix AI
filler words throughout.

Rewrite mlsysim-overview.svg to use text-anchor positioning instead of
font-dependent transform+tspan offsets, fixing text overflow in all
domain boxes when rendered via rsvg-convert.

Add SVG-to-PDF auto-conversion step in build.sh so figures stay in sync.
2026-03-13 18:28:04 -04:00
Vijay Janapa Reddi
3b387fa7d3 docs(mlsysim): integrate 3-Tier resolver architecture into paper
- Add Anchor 7 to validate Optimizer convergence against Llama 3 strategy.
- Add Case Study R4 detailing automated parallelism search via Tier 3 Optimizer.
- Expand Section 5.3 to explicitly define how Optimizers span across the 22 Walls taxonomy.
- Update Future Work to reframe multi-objective searches as Tier 3 Pareto Frontiers.
- Unify terminology globally: replace generic 'solvers' with 'resolvers' to respect the new 3-Tier semantics (Models, Solvers, Optimizers).
- Update Listing 2 comments to map directly to Layer A (Demand) and Layer D (Supply).
2026-03-13 12:35:16 -04:00
Vijay Janapa Reddi
d8a8047bde docs(mlsysim): update paper, SVG figures, and bibliography
Tighten paper layout, add architecture-stack SVG, update all figure
SVGs for consistency, expand references, and add related work guide.
2026-03-12 16:04:51 -04:00
Vijay Janapa Reddi
5c52507f27 feat(mlsysim): add prompt caching to ServingSolver and release-readiness fixes
Add cached_prefix_len parameter to ServingSolver for prefix/prompt
caching (grounded in Zheng et al. SGLang/RadixAttention). TTFT reduces
proportionally to cache hit ratio; ITL and memory unchanged.

Export 4 missing solvers from __init__.py (ContinuousBatchingSolver,
WeightStreamingSolver, TailLatencySolver, CheckpointSolver).

Fix dict-style access in for-engineers.qmd and architecture_comparison
tutorial. Add math sections 3.4-3.6 for prompt caching, disaggregated
serving (Patel et al. Splitwise ISCA'24), and speculative decoding
(Leviathan et al. ICML'23) with literature citations. Update paper.tex
Wall 4 description to include prompt caching. Fix remaining MLSYSIM
branding in _quarto-html.yml.
2026-03-12 16:04:51 -04:00
Vijay Janapa Reddi
e38c7b9af8 docs(mlsysim): add SVG figures for paper and newsletter 2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi
cae8a9a503 fix(mlsysim): tighten paper layout and fix figure sizing
- Fix solver-chaining PDF: re-export from SVG eliminating whitespace
- Add titlesec spacing to tighten section gaps
- Shorten wrapping headings for use cases
- Rename Ops section heading to fit one line
- Shrink figures and fix float placement
- Move Figure 1 after contributions for better flow
- Paper reduced from 23 to 22 pages
2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi
5aa518f8c1 fix(mlsysim): harmonize Ops terminology and clarify wall/solver count in paper
Expert reviewer read-through identified four flow issues:
- Line 267: "Operations" → "Ops" in Design Philosophy section
- Line 589: "The Operations walls" → "The Ops walls" in taxonomy
- Line 890: conclusion now says "21 solvers spanning 22 systems walls"
- Removed extra blank line before Architecture section
2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi
707920f92c docs(mlsysim): formal tone audit and SOTA solver features
Paper: comprehensive formal tone review replacing informal/textbook
language with academic paper register throughout. Removes italic wall
hooks, rhetorical lists, conversational emphatics, and metaphorical
phrasing. Enriches wall descriptions with concrete numbers and
citations.

Code: add ZeRO/FSDP sharding, LoRA, activation recomputation,
compute/communication overlap, speculative decoding, and disaggregated
serving support across engine, solver, and model types. Add SOTA test
coverage.
2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi
25f5ca99a3 docs(mlsysim): improve paper structure and complete quartodoc config
Paper improvements:
- Fix Cerebras Wall 6 reference: bare text → \citealt{lie2023cerebras}
- Move 22-wall summary table earlier (before wall descriptions, not after)
- Introduce MLSys Zoo terminology (Silicon/Model/Fleet/Infrastructure Zoos)
- Add appendix bridge sentence in solver formalism section

Quartodoc: add all 7 missing solvers to config, ordered by wall number
(EfficiencySolver, TransformationSolver, TopologySolver, InferenceScalingSolver,
ResponsibleEngineeringSolver, SensitivitySolver, SynthesisSolver)
2026-03-12 16:04:50 -04:00
Vijay Janapa Reddi
60506faa18 docs(mlsysim): restructure paper for 6-domain taxonomy and expand appendix
Split Fleet domain into Fleet (multi-node coordination, W14-16) and
Operations (economics, sustainability, safety, W17-20). Add inline
citations for Walls 4, 8, 9, 10. Expand Wall 6 with compute-injection
overlap rationale and Cerebras citation. Add persona framing paragraphs
for use cases. Cut Lab Integration section, fold accessibility point
into conclusion. Expand appendix from 4 to all 21 solvers by domain.
2026-03-12 16:04:50 -04:00