mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-10 15:49:25 -05:00
Two deliverables in one commit because they are co-dependent:
1. bib_lint.py — repo-wide BibTeX parser, validator, and formatter
New tool at book/tools/bib_lint.py. Proper stateful parser (not
regex) that handles nested braces in titles (\'{e}, \^{o}, \"{o}
LaTeX accents), double vs. brace quoted field values, biblatex
date-vs-year equivalence, and the long author lists common in
multi-author ML papers (Theano, Habitat, etc.).
Enforces §5 Bibliography Hygiene rules:
- Required fields per entry type (publisher, journal, booktitle,
year, author, title, institution, school as applicable)
- Forbidden fields dropped (organization, address — per MIT
Press round 1 cleanup)
- Journal names spelled out (detect J. Mach. Learn. Res. patterns)
- Author list rules (no et al., no em-dash shorthand, warn on
initial-only first names)
- Pages format (require --, not single hyphen)
- DOI format (bare, no https:// prefix)
- x-verified ISO-8601 date validation
Provides bib_lint.apply_fields() as the SAFE alternative to
regex-based field insertion. Used by the parallel-agent sweep
apply pipeline to insert verified metadata without risk of
mangling titles containing braces.
CLI modes: --check (validate, exit non-zero on new errors),
--fix (rewrite to canonical form), --report (detailed violations,
default), --baseline (regenerate the grandfather allow-list).
2. Pilot sweep: vol1 (715 entries) + CITATION.bib (1 entry)
Pass 16 parallel-agent sweep Batch A. One general-purpose agent
verified 25 flagged entries via DBLP, Crossref, arXiv, publisher
pages, OpenReview, and ACL Anthology. Operated under the ten-rule
anti-hallucination contract documented in §5:
- Every field traced to a source URL with verbatim quote
- HIGH confidence required ≥2 independent-domain sources
- NOT_FOUND always acceptable over guessing
- DOI captured opportunistically for all verified entries
Results: 24 VERIFIED (19 HIGH + 5 MEDIUM), 1 NOT_FOUND. DOIs
added to 6 entries (plus 1 canonical-DOI correction for
Rajbhandari2020, replacing the ACM 10.5555 placeholder with the
IEEE SC20 DOI). All 24 carry x-verified / x-verified-by /
x-verified-source markers for the audit trail.
Not applied: wolf2017we. Agent flagged this entry as likely
fabricated — the title claims a paper about Google Bigtable
authored by Thomas Wolf (Hugging Face), but the URL points to a
Hugging Face blog about datasets. Title, author, and URL do not
match each other. Flagged for human review, intentionally left
in the open-findings ledger.
3. Pre-commit hook integration
Repo-wide bib_lint --check hook added to .pre-commit-config.yaml,
runs after bibtex-tidy on every .bib file in the repo (not just
vol1/vol2). Uses a baseline allow-list at bib_lint_baseline.json
that grandfathers 226 pre-existing violations across all 19 .bib
files — only NEW violations block commits going forward.
bibtex-tidy scope was also broadened from quarto/contents to
the whole repo so paper bibs (mlsysim, tinytorch, interviews,
periodic-table) get formatted consistently.
Post-sweep state:
- vol1 bibliography-hygiene findings: 24 -> 1 (wolf2017we)
- CITATION.bib bibliography-hygiene findings: 1 -> 0
- Total remaining: 195 findings across vol2 + 6 paper .bib files,
scheduled for the parallel-agent fan-out (Batches B-G)
17 lines
606 B
BibTeX
17 lines
606 B
BibTeX
@inproceedings{reddi2024mlsysbook,
|
|
title = {MLSysBook.AI: Principles and Practices of Machine Learning Systems Engineering},
|
|
author = {Reddi, Vijay Janapa},
|
|
year = {2024},
|
|
booktitle = {
|
|
2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS)
|
|
},
|
|
publisher = {IEEE},
|
|
pages = {41--42},
|
|
doi = {10.1109/CODES-ISSS60120.2024.00015},
|
|
url = {https://mlsysbook.org},
|
|
note = {Available at: https://mlsysbook.org},
|
|
x-verified = {2026-04-08},
|
|
x-verified-by = {pass-16-bib-sweep},
|
|
x-verified-source = {https://dblp.org/rec/conf/codesisss/Reddi24},
|
|
}
|