MLSysBook CLI (binder) — Architecture & Reference
The ./book/binder CLI is the single entry point for building, validating, and formatting the MLSysBook. All pre-commit hooks route through binder.
Architecture
cli/
├── main.py # CLI router, help system (Rich UI)
├── core/
│ ├── config.py # Quarto config management (symlinks, switching)
│ └── discovery.py # Chapter/file discovery, fuzzy matching
├── commands/
│ ├── build.py # Build HTML/PDF/EPUB (full book or chapters)
│ ├── preview.py # Live dev server with hot reload
│ ├── validate.py # All validation checks (check group)
│ ├── formatting.py # Auto-formatters (format group)
│ ├── reference_check.py # Bibliography verification (Crossref/DOI)
│ ├── bib.py # Bibliography management
│ ├── info.py # Stats, figures, concepts, headers, acronyms
│ ├── render.py # Plot rendering to PNG gallery
│ ├── newsletter.py # Newsletter drafts and publishing
│ ├── clean.py # Build artifact cleanup
│ ├── debug.py # Failing chapter/section finder
│ ├── doctor.py # Health check
│ └── maintenance.py # Image compression, repo health
├── formats/ # Format-specific handlers
└── utils/ # Shared utilities
Command Groups
Build & Preview
./binder build pdf --vol1 # Build Volume I PDF
./binder build html --vol2 # Build Volume II website
./binder build pdf intro # Single chapter PDF
./binder preview # Live dev server (full book)
./binder preview intro # Live dev server (single chapter)
Check (Validation)
All validation is in validate.py. Each check belongs to a group and has a scope.
./binder check <group> # Run all scopes in a group
./binder check <group> --scope <name> # Run one specific scope
./binder check all # Run everything
./binder check all --vol1 # Volume I only
Check Groups & Scopes
| Group | Scope | What it checks |
|---|---|---|
| refs | python-syntax |
Python code block syntax errors |
inline-python |
Inline {python} reference validity |
|
cross-refs |
@sec-, @fig-, @tbl- reference targets exist |
|
citations |
[@key] citation keys exist in .bib |
|
inline |
Inline ref patterns, scope analysis | |
self-ref |
Self-referential cross-references | |
| labels | duplicates |
Duplicate #fig-, #tbl-, #lst- labels |
orphans |
Defined labels never referenced | |
fig-labels |
Underscore in figure label IDs | |
| headers | ids |
All ## sections have {#sec-...} IDs |
| footnotes | placement |
Footnotes not in tables/captions |
integrity |
Every [^fn-...] reference has a definition |
|
cross-chapter |
Duplicate footnote IDs across chapters | |
| figures | captions |
Figures have fig-cap and fig-alt |
div-syntax |
Figures use div syntax (not markdown-image) | |
flow |
Figures placed near first reference | |
files |
Referenced image files exist on disk | |
| rendering | patterns |
LaTeX+Python rendering hazards |
python-echo |
Python blocks have echo: false |
|
indexes |
\index{} not inline with headings |
|
dropcaps |
Drop cap compatibility | |
parts |
Part key validation | |
heading-levels |
No skipped heading levels (##→####) | |
duplicate-words |
Repeated consecutive words | |
grid-tables |
Warn about grid tables (prefer pipe) | |
tables |
Table content validation | |
ascii |
Non-ASCII characters in prose | |
percent-spacing |
No space between value and % |
|
unit-spacing |
Space between number and unit (100 ms) |
|
binary-units |
Use GB/TB not GiB/TiB |
|
contractions |
No contractions in body prose | |
unblended-prose |
No split paragraphs | |
times-spacing |
Space after $\times$ before word |
|
times-product-spacing |
Space before $\times$ after inline code |
|
purpose-unnumbered |
Purpose sections have {.unnumbered} |
|
div-fences |
Malformed ::: / :::: fences |
|
| images | formats |
Image file format validation |
external |
No external image URLs | |
svg-xml |
SVG XML well-formedness | |
| json | syntax |
JSON file syntax |
| units | physics |
Pint unit consistency in mlsys/ |
| spelling | prose |
Spell check prose content |
tikz |
Spell check TikZ labels | |
| epub | hygiene |
Fast SVG/BibTeX source invariants (pre-commit, <1s) |
smoke |
Reader-compat smoke checks on built EPUBs — no Java required | |
epubcheck |
W3C epubcheck on built EPUBs under _build/epub-vol*/ (CI, ~7s per volume) |
|
| sources | citations |
Source citation verification |
| references | hallucinator |
Bibliography entry verification (Crossref/DOI) |
| content | tree |
Content tree structure |
Format (Auto-formatters)
./binder format blanks # Collapse extra blank lines
./binder format python # Format Python blocks (Black, 70 chars)
./binder format lists # Fix list spacing
./binder format divs # Fix div/callout spacing
./binder format tables # Format grid tables
./binder format prettify # Align pipe table columns
./binder format all # Run all formatters
Other Commands
./binder info stats --vol1 # Word counts, figure counts
./binder info figures --vol1 # Extract figure list
./binder info concepts --vol1 # Extract concept list
./binder bib sync --vol1 # Sync bibliography
./binder render plots --vol1 # Render matplotlib plots to PNG
./binder clean # Remove build artifacts
./binder doctor # Comprehensive health check
./binder debug pdf --vol1 # Find failing chapter in PDF build
Pre-commit Integration
Every pre-commit hook routes through ./book/binder. The pattern is:
- id: book-check-<name>
entry: ./book/binder check <group> --scope <scope>
No inline bash scripts. One CLI, one validation framework, consistent output.
EPUB Checks — Two Layers, One CLI Surface
EPUB validation exists in two layers that both sit inside the binder CLI. Developers and CI exercise identical code paths, so there is no drift between "what pre-commit runs" and "what CI enforces."
./binder check epub # all scopes (default)
./binder check epub --scope hygiene # layer 1 — source-level, pre-commit grade
./binder check epub --scope epubcheck # layer 2 — post-build W3C validator
./binder check epub --scope epubcheck \
--max-fatal 0 --max-errors 0 # tighten thresholds (CI-style)
./binder check epub --json # machine-readable output (tools, CI)
Implementation: book/cli/commands/validate.py (_run_epub_hygiene, _run_epubcheck native methods) and the shared primitives in book/cli/commands/_epub_checks.py.
Layer 1 — hygiene (source-level, fast)
Runs in <1s across all SVG and BibTeX files under book/quarto/contents/. No EPUB build required. Wired into the pre-commit hook book-epub-hygiene.
Catches the four source-level patterns that historically broke builds in April 2026 (issues #1014, #1052, #1148):
| Code | What it catches | Symptom at epubcheck time |
|---|---|---|
svg-c0 |
C0 control characters (U+00–U+1F except TAB/LF/CR) inside SVG aria-label="..." values. Produced by matplotlib rendering titles that contain raw-bytes representations. |
FATAL(RSC-016) — EPUB fails to load in Kindle and ClearView. |
svg-dupe-marker |
Duplicate <marker id="X"/> inside a single SVG's <defs>. Produced by copy-paste during figure editing. |
ERROR(RSC-005) — figure may not render. |
bib-url-escape-underscore |
\_ inside a url= or http-bearing doi= BibTeX field. BibTeX's LaTeX escape leaks through citeproc. |
ERROR(RSC-020) — Kindle's URL validator rejects the href. |
bib-url-escape-percent |
\% inside the same URL fields. |
ERROR(RSC-020). |
bib-url-raw-angle |
Raw < or > inside URL fields (legal in SICI DOIs but forbidden by strict URI syntax). |
ERROR(RSC-020). |
If one of these triggers during commit, the fix is almost always at source:
- SVG C0 chars — open the matplotlib script that produced the plot and find the
plt.title(...)/ax.set_xlabel(...)call with arepr(bytes)or similar. Replace with the decoded string. - Duplicate SVG markers — open the SVG in a text editor and delete the duplicate
<marker>element. The compact one-line form and the multi-line expanded form are two common duplicates; keep one. - BibTeX URL escapes — edit the
.bibentry: replace\_with_,\%with%,</>with%3C/%3E. These escapes are LaTeX-only; BibTeXurl = { ... }fields do not require them.
Layer 2 — smoke (built-EPUB, no Java)
Runs in <1s against every EPUB under _build/epub-vol*/. Catches reader-compatibility issues that epubcheck does not, because it enforces EPUB 3 spec conformance while readers enforce a stricter subset:
| Code | What it catches | Symptom |
|---|---|---|
smoke-css-custom-property-decl |
--var-name: value; declarations in packaged CSS. |
ClearView / Tolino (pre-2023 firmware) silently fail to load the EPUB. |
smoke-css-custom-property-use |
var(--x) references. |
Same readers render the default fallback, producing visual regressions. |
smoke-external-resource |
src="https://..." / <link href="https://..."> in any XHTML. |
EPUB readers do not fetch remote assets; image shows as broken-icon on every reader. |
Fix at source in every case — there is no legitimate reason a packaged EPUB should reference external assets, and CSS custom properties should be inlined (or post-processed away) for the packaged stylesheet.
Useful when Java is not installed locally and epubcheck is unavailable; the other two layers still provide real coverage.
Layer 3 — epubcheck (built-EPUB, full spec)
Runs the W3C epubcheck validator against every EPUB discovered under book/quarto/_build/epub-vol*/. Emits ValidationIssue records with file, line, column, RSC/OPF code, severity, and human-readable message. When running under GitHub Actions, also emits ::error file=...,line=...:: annotations so findings show up inline on PR diffs.
Requires the epubcheck Python package (pinned in book/tools/dependencies/requirements.txt) plus a Java runtime (JRE 8+). If neither python -m epubcheck nor the epubcheck system binary is available, the check emits an epubcheck-missing issue with install instructions rather than silently passing.
Thresholds (from the command line, environment variables, or the CI workflow). Two modes:
-
Flat thresholds (for local use or simple CI):
--max-fatal N— fail if FATAL count exceeds N (default: 0). Kindle and ClearView reject any EPUB with FATAL, so 0 is the correct production threshold.--max-errors N— fail if ERROR count exceeds N (default: unlimited). Tighten to 0 once the baseline is clean.
-
Baseline ratchet (preferred in CI):
--baseline PATH— fail only if per-volume counts increase over what is recorded atPATH(JSON). Doesn't cliff-fail during incremental cleanup.--update-baseline— rewrite the baseline file to the current counts. Run after a cleanup lands; commit the updated file in the same PR. This is how you lower the ceiling.- When
--baselineis supplied, the flat thresholds are ignored.
The canonical baseline lives at book/tools/audit/epubcheck-baseline.json. The CI workflow (book-validate-dev.yml) uses the ratchet by default — a PR cannot land if epubcheck counts regress past that file. Initial state (April 2026) is 0/0/0 for both volumes, so any new FATAL, ERROR, or WARNING blocks the PR.
Defense in depth
The three layers exist because none alone is sufficient:
- Hygiene alone is a whitelist of known patterns. It cannot catch a category of error epubcheck invents in a future version, nor renderer-emitted markup (bare
<br>,--in TikZ comments) that only appears after Quarto + post-process have run. - Epubcheck alone takes ~7 seconds per volume and requires a full EPUB build. Developers under time pressure will find ways around it.
The rendered-EPUB layer also gets belt-and-suspenders support from book/quarto/scripts/epub_postprocess.py, a Quarto post-render hook that sanitizes XHTML/SVG/OPF in the built EPUB (strips -- from HTML comments, closes bare <br> tags, strips C0 chars from SVG aria-labels, normalizes URL escapes, declares the mathml OPF property). So a regression caught at source by hygiene, missed there, rescued by post-process, and missed again is still caught by epubcheck in CI. Three independent nets.
EPUB — How This Integrates with the Book Workflow
Three layers of defense, each invoked automatically at the moment it adds value:
Author workflow (writing and committing)
- Write normally. No new constraints on QMD content. The EPUB checks only fire on SVG and BibTeX source — prose changes pass through untouched.
- Commit. The
book-epub-hygienepre-commit hook runs./binder check epub --scope hygieneif you touched SVG, BibTeX, or the check implementation. Fails in <1s if an SVG has duplicate<marker>ids or a bib URL has\_escapes. The failure message tells you to auto-repair with./binder check epub --scope hygiene --fix. - Build locally.
./binder build epub --vol1runs:- Hygiene preflight (fast-fail on source issues, ~100ms).
- Quarto render (~2 minutes).
- Post-render sanitizer (fixes renderer-emitted issues).
- Smoke + epubcheck post-flight (~7s) — verifies the final EPUB against the same CI baseline.
- Push. CI builds both volumes and re-runs
./binder check epub --scope epubcheck --baseline …. Any regression past the recorded counts blocks the PR.
Release workflow (preparing an EPUB for readers)
- Build:
./binder build epub --all(or per-volume). Post-flight already validates — if the final line is✓ smokeand✓ epubcheck, the file is ready. - Visual spot-check: open the file in Sigil or Calibre Editor. Structural checks don't guarantee aesthetic quality.
- Ship: upload the file from
book/quarto/_build/epub-vol*/.
When CI (or local post-flight) blocks you
The failure line names the volume, severity, and delta:
epubcheck: regression against baseline (2 FATAL, 15 ERROR total)
• vol1 FATAL: 2 > baseline 0 (+2)
Two dispositions:
- Real regression: fix the underlying issue.
./binder check epub helpmaps every error code to a source-level fix../binder check epub --scope hygiene --fixauto-repairs the four mechanical classes. - Accepted increase (e.g., a Quarto upgrade introduced a new warning class you're OK living with):
./binder check epub --scope epubcheck --baseline book/tools/audit/epubcheck-baseline.json --update-baseline— commit the updated JSON in the same PR so the audit trail is clear.
Escape hatches
Rare but real cases where you need to build anyway:
./binder build epub --vol1 --skip-hygiene— bypass pre-render hygiene (e.g., debugging a source-level issue that's triggering the check you're trying to investigate)../binder build epub --vol1 --skip-validate— bypass post-render validation (e.g., iterating on a known-broken build; no Java locally).
Both are loud about being bypassed — the skipped step prints a yellow warning so the bypass is never silent.
Onboarding
A new contributor runs ./binder doctor once. The EPUB section reports Java availability, epubcheck availability, existing EPUB artifacts, and source hygiene in one table, with install commands inline if anything is missing. No separate EPUB setup document to hunt for.
Script Delegation
Some commands still delegate to scripts under book/tools/scripts/ (spelling, image formats, table formatting, Python formatting). These will be migrated to native CLI modules over time. EPUB checks have already been migrated: the hygiene and epubcheck scopes live in book/cli/commands/_epub_checks.py as pure Python, not subprocess-to-script.