mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-07-16 06:07:17 -05:00

Files

Vijay Janapa Reddi c8b887249c docs: generalize internal rule-file references in comments

Replace pointers to the private project rules/docs tree (relative .claude/rules
and .claude/docs paths) in code comments and docstrings with neutral phrasing
("the project prose style guide", etc.). Load-bearing runtime paths that the
tooling reads or writes are left intact.

2026-05-30 17:32:19 -04:00

audit_math_pdf.py

…

audit_math_rendering.py

…

audit_pdf_spot_check.py

…

README.md

docs: generalize internal rule-file references in comments

2026-05-30 17:32:19 -04:00

README.md

Math-rendering audit tools

Three scripts for catching LaTeX leakage in rendered HTML and PDF output. Built during the April 2026 math-rendering audit (see the project prose style guide for the underlying conventions these tools enforce).

All scripts are designed to run from the repo root.

What each script does

Script	Purpose
`audit_math_rendering.py`	Builds each chapter's HTML via `binder` and scans the rendered output for raw LaTeX leakage outside MathJax/code zones.
`audit_math_pdf.py`	Builds per-chapter PDFs, extracts text via `pdftotext`, renders pages to PNG via `pdftoppm` for visual spot-checking, and applies the same leak detector to the extracted text.
`audit_pdf_spot_check.py`	Scans the PDFs produced by `audit_math_pdf.py` for known fix sites (regex map maintained in the script) and emits a markdown manifest pointing to the exact page numbers / PNGs to inspect.

Quick start

# HTML audit across the whole book (~10 minutes; Binder public API)
./book/binder check math --scope render-audit

# Targeted script-level audit, useful while developing the audit itself
python3 tools/audit/audit_math_rendering.py vol1/introduction vol2/inference

# Just re-scan an existing build without rebuilding
python3 tools/audit/audit_math_rendering.py --skip-build

# PDF audit (slower; needs LaTeX toolchain + poppler-utils)
python3 tools/audit/audit_math_pdf.py vol1/introduction
python3 tools/audit/audit_math_pdf.py --fixed   # only chapters from the April 2026 fix set

# Generate visual spot-check map for the saved PDFs
python3 tools/audit/audit_pdf_spot_check.py

Outputs

Both audits write reports to the repo root by default:

audit-math-report.json / audit-math-report.md — HTML audit results
audit-pdf-report.json / audit-pdf-report.md — PDF audit results
audit-pdf-spot-check.md — visual spot-check manifest
audit-pdf-output/<vol>/<chap>/{chap.pdf,pages/page-NNN.png} — saved PDFs and page images

These paths are gitignored (see top-level .gitignore); they are local artifacts intended for inspection, not commits.

Concurrency warning

Do not run multiple binder build invocations in parallel. They all mutate the shared book/quarto/_quarto.yml and will corrupt each other's state. The HTML and PDF auditors are both serial internally; just don't run them in two terminals at once.

Dependencies

Standard binder build dependencies (Quarto + project venv)
pdftotext and pdftoppm from poppler-utils (PDF audit only)

Known false-positive: code blocks in PDFs

pdftotext extracts code blocks verbatim, so any chapter that contains LaTeX-style pseudocode in a code block will produce "leaks" in the PDF text scan that are not actual rendering bugs. Treat the PDF text scan as a soft signal; the rendered PNGs are the source of truth for PDF output.