Wraps up the bib-verify sweep across vol1, vol2, and the paper sub-projects,
and corrects three citation issues introduced earlier in the branch:
- Restore tang20211bit (1-bit Adam, Tang et al. ICML 2021) in vol2 bib and
in collective_communication.qmd. The earlier sweep had renamed the cite
to li2022, which now resolved to AlphaCode or 1-Bit LAMB.
- Restore micikevicius2018mixed in vol1 bib to point at "Mixed Precision
Training" (Micikevicius et al. ICLR 2018). The entry had been overwritten
with an unrelated OpenSeq2Seq paper while the cite key stayed the same.
- Drop the unused li2022 (AlphaCode) entry and the duplicate li2022 (1-Bit
LAMB) entry from vol2 bib.
Also remove eight same-paper duplicate entries that the sweep had left
behind (vol1: lawson1979, gholami2022, lange2009, ribeiro2016; vol2:
bursztein2024, rasley2020, sevilla2022, narayanan2019).
After this commit the bibs have zero duplicate keys and zero orphan
citations across both volumes and all five paper sub-projects.
Two deliverables in one commit because they are co-dependent:
1. bib_lint.py — repo-wide BibTeX parser, validator, and formatter
New tool at book/tools/bib_lint.py. Proper stateful parser (not
regex) that handles nested braces in titles (\'{e}, \^{o}, \"{o}
LaTeX accents), double vs. brace quoted field values, biblatex
date-vs-year equivalence, and the long author lists common in
multi-author ML papers (Theano, Habitat, etc.).
Enforces §5 Bibliography Hygiene rules:
- Required fields per entry type (publisher, journal, booktitle,
year, author, title, institution, school as applicable)
- Forbidden fields dropped (organization, address — per MIT
Press round 1 cleanup)
- Journal names spelled out (detect J. Mach. Learn. Res. patterns)
- Author list rules (no et al., no em-dash shorthand, warn on
initial-only first names)
- Pages format (require --, not single hyphen)
- DOI format (bare, no https:// prefix)
- x-verified ISO-8601 date validation
Provides bib_lint.apply_fields() as the SAFE alternative to
regex-based field insertion. Used by the parallel-agent sweep
apply pipeline to insert verified metadata without risk of
mangling titles containing braces.
CLI modes: --check (validate, exit non-zero on new errors),
--fix (rewrite to canonical form), --report (detailed violations,
default), --baseline (regenerate the grandfather allow-list).
2. Pilot sweep: vol1 (715 entries) + CITATION.bib (1 entry)
Pass 16 parallel-agent sweep Batch A. One general-purpose agent
verified 25 flagged entries via DBLP, Crossref, arXiv, publisher
pages, OpenReview, and ACL Anthology. Operated under the ten-rule
anti-hallucination contract documented in §5:
- Every field traced to a source URL with verbatim quote
- HIGH confidence required ≥2 independent-domain sources
- NOT_FOUND always acceptable over guessing
- DOI captured opportunistically for all verified entries
Results: 24 VERIFIED (19 HIGH + 5 MEDIUM), 1 NOT_FOUND. DOIs
added to 6 entries (plus 1 canonical-DOI correction for
Rajbhandari2020, replacing the ACM 10.5555 placeholder with the
IEEE SC20 DOI). All 24 carry x-verified / x-verified-by /
x-verified-source markers for the audit trail.
Not applied: wolf2017we. Agent flagged this entry as likely
fabricated — the title claims a paper about Google Bigtable
authored by Thomas Wolf (Hugging Face), but the URL points to a
Hugging Face blog about datasets. Title, author, and URL do not
match each other. Flagged for human review, intentionally left
in the open-findings ledger.
3. Pre-commit hook integration
Repo-wide bib_lint --check hook added to .pre-commit-config.yaml,
runs after bibtex-tidy on every .bib file in the repo (not just
vol1/vol2). Uses a baseline allow-list at bib_lint_baseline.json
that grandfathers 226 pre-existing violations across all 19 .bib
files — only NEW violations block commits going forward.
bibtex-tidy scope was also broadened from quarto/contents to
the whole repo so paper bibs (mlsysim, tinytorch, interviews,
periodic-table) get formatted consistently.
Post-sweep state:
- vol1 bibliography-hygiene findings: 24 -> 1 (wolf2017we)
- CITATION.bib bibliography-hygiene findings: 1 -> 0
- Total remaining: 195 findings across vol2 + 6 paper .bib files,
scheduled for the parallel-agent fan-out (Batches B-G)