Files
cs249r_book/shared/scripts/check-internal-links.py
Vijay Janapa Reddi 152b8630dc fix(ci): clear all 8 failing pre-commit hooks on dev (#1413)
* fix(content): clear two mitpress-above-below pre-commit failures

The "📚 Book ·  Validate (Dev)" workflow has been failing on dev for
8+ consecutive runs because the mitpress-above-below pre-commit hook
flags spatial references like "above"/"below" inside body prose and
figure captions (the MIT Press style guide wants @sec-/@fig- cross-refs
or "earlier"/"later" instead). Two pre-existing violations were tripping
the hook on every push:

  - book/quarto/contents/vol1/responsible_engr/responsible_engr.qmd:1604
    fig-cap for fig-data-governance-pillars said "obligations discussed
    below: privacy, security, compliance, and transparency" — but those
    four obligations are *immediately* listed in the same caption, so
    "discussed below" was redundant. Reworded to "obligations of
    privacy, security, compliance, and transparency …".

  - book/quarto/contents/vol2/network_fabrics/network_fabrics.qmd:1217
    fig-cap for fig-congestion-cascade said "the PFC backpressure
    cascades described below." Reworded to "described later in this
    section." which is what the hook wants.

After our 4 release-prep merges (PR-1/2/7/12) cleaned up the other
hook failures (spelling, bibtex tidy, pipe tables, contractions,
mitpress-vs-period, …), this was the last remaining failing hook.
Verified locally:

  pre-commit run mitpress-above-below --all-files
  MIT Press: No above/below spatial refs (use cross-refs).....Passed

These are pure copy-edits to figure captions; no semantic change to
the diagrams or surrounding text.

* fix(check-internal-links): suppress 4 categories of false positives

The Tier 1 link checker (shipped in PR #1404) was over-eager and
flagged author content as broken in four documented patterns:

1. TikZ source inside HTML comments. Link regex matched `\node[mycycle](B1)`
   as a Markdown link `[mycycle](B1)`. Fix: strip `<!-- ... -->` bodies
   before scanning, preserving line/column offsets so any *real* failure
   we report stays accurate.
2. Quarto cross-references like `[Foo](@sec-bar)`, `@fig-x`, `@tbl-y`.
   These resolve through the project xref index at render time, not the
   filesystem; book/binder owns that validation. Fix: skip targets whose
   first token is `@sec-/@fig-/@tbl-/@eq-/@lst-/@thm-/@cor-/@def-/@exr-/
   @exm-/@prp-`.
3. Uppercase URL schemes (`HTTPS://`, `HTTP://`) — common after mobile
   auto-capitalize or copied citations. Fix: case-insensitive prefix
   match for the EXTERNAL_SCHEMES tuple.
4. GitHub-style emoji-prefix slugs in `.md` READMEs (e.g.
   `## 🎯 20 Progressive Modules` produces anchor `#-20-progressive-modules`
   on github.com, but Pandoc would slugify to `progressive-modules`).
   Fix: register both Pandoc-style and GitHub-style slugs as valid
   anchors so neither rendering target trips the checker.

Drops repo-wide broken-link count from 150 → 84 (false positives only;
no real link rot is masked). Real rot is fixed in a separate commit so
the checker improvement can be reviewed independently.

* fix(content): repair internal-link rot across 10 files

Concrete link rot the new checker (PR #1404) surfaced once its false
positives were cleared. None of these are stylistic; each link points
at a path or anchor that does not exist.

- README/README_{zh,ja,ko}.md (24 links): translation files live in
  README/ so paths to repo-root targets need a `../` prefix
  (`book/README.md` -> `../book/README.md`, etc.).
- mlsysim/docs/contributing.qmd (21 links): `../slides/...` pointed
  inside `mlsysim/`; the slides root is two levels up
  (`../../slides/...`).
- mlsysim/docs/cli-reference.qmd: `getting-started.qmd#bring-your-own-yaml-byoy`
  removed; retarget to `#defining-custom-models` (closest surviving
  section about user-supplied model specs).
- mlsysim/docs/for-engineers.qmd, for-instructors.qmd:
  `solver-guide.qmd#extending-mlsysim` no longer exists; retarget to
  `#writing-a-custom-solver` (the surviving custom-solver guide).
- book/tools/scripts/README.md: `../docs/BINDER.md` resolved to
  `book/tools/docs/BINDER.md` (nonexistent); the file actually lives
  at `book/docs/BINDER.md`, which is `../../docs/BINDER.md` from here.
- book/quarto/contents/frontmatter/index.qmd:
  `about.qmd#about-the-book-unnumbered` anchor was removed when the
  About heading was simplified; drop the anchor so the link lands at
  the top of the page (which IS the About section).
- tinytorch/datasets/tinytalks/README.md: `scripts/README.md` was
  never created; point at the directory listing instead.

* chore(pre-commit): exclude 3 forward-looking files from internal-link checker

Three files reference content that does not (yet) exist on the
filesystem; the references are intentional rather than rot, so they
should not block CI:

- labs/index.qmd: lists the 33 planned labs (vol1/lab_00..lab_16,
  vol2/lab_01..lab_16) as a roadmap. Links go live as each lab ships.
  De-linking now would lose the visual roadmap. When a lab lands the
  exclusion narrows naturally on its own.
- labs/PROTOCOL.md, labs/TEMPLATE.md: internal authoring docs that
  reference `../.claude/docs/labs/{PROTOCOL,TEMPLATE}.md`. The
  `.claude/` tree is per-worktree and not always present at the same
  relative path; these are author-tooling refs, not user-facing.

Net effect: the link checker is now green on a clean checkout. The
exclude block uses comments per existing convention so the rationale
is discoverable from the config alone.

* fix(content): clear codespell, contractions, and vs. pre-commit failures

Three pre-existing pre-commit hooks were failing on the dev branch
prior to the release-prep merges. Each is a small content normalization:

- codespell (2): re-declares -> redeclares (book/quarto/config/shared/README.md);
  unparseable -> unparsable (handled in the check-internal-links rewrite).
- contractions (2):
  * socratiq/socratiq.qmd callout: "If you're" -> "If you are".
  * nn_architectures fig-alt for the attention-visualization figure:
    "didn't" -> "did not". Alt-text is descriptive prose for screen
    readers, not a verbatim transcription of pixels, so expanding the
    contraction matches MIT Press style without changing the figure
    itself.
- mitpress-vs-period (6): bare `vs` -> `vs.` per MIT Press 2026 §10.5
  in benchmarking.qmd, distributed_training.qmd (x3 across two Python
  docstrings rendered in code listings), fault_tolerance.qmd, and
  inference.qmd. Code-listing strings are visible prose in the rendered
  PDF, so the rule applies there as well.

* chore: bibtex-tidy auto-format outputs

Outputs of the bibtex-tidy pre-commit hook (which auto-fixes its own
input). Picked up here so that running pre-commit on a clean checkout
no longer reports a "files were modified" failure for the same files
on every invocation. Pure formatting; no entry semantics changed.
2026-04-20 12:58:28 -04:00

480 lines
17 KiB
Python
Executable File

#!/usr/bin/env python3
"""Tier 1 link checker: validate internal Markdown / Quarto links offline.
Scope on purpose:
- Validate ONLY relative-path links and same-file anchor links inside
`.md` / `.qmd` files.
- DO NOT touch external URLs (http/https/mailto/tel/...). External
reachability is Lychee's job in CI; doing it here would make
pre-commit slow and network-flaky.
Why a separate tool from `book/binder`:
- The book toolchain owns Quarto cross-references (`@fig-foo`,
`@sec-bar`), bibliography keys, label hygiene, etc.
- This tool owns plain Markdown link integrity and works repo-wide
(every Quarto site, plus loose READMEs).
Usage:
python3 shared/scripts/check-internal-links.py # check the whole repo
python3 shared/scripts/check-internal-links.py FILE... # check named files
python3 shared/scripts/check-internal-links.py --staged # check git-staged files
python3 shared/scripts/check-internal-links.py --quiet # only print failures
Exit codes:
0 every internal link resolves
1 one or more broken internal links (printed file:line: detail)
2 invocation error
"""
from __future__ import annotations
import argparse
import os
import re
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parents[2]
# Directories we never validate links in.
EXCLUDE_DIRS = {
".git",
".venv",
"node_modules",
"_site",
"_book",
"_build",
".quarto",
"htmlcov",
"site-packages",
# Per-stage build outputs and vendored extensions:
"_freeze",
"_extensions",
# The dev preview mirror gets very large and is a copy of generated HTML.
"_archive",
}
# File globs we walk when no explicit list is given.
DEFAULT_GLOBS = ("**/*.md", "**/*.qmd")
# Match Markdown inline links: [text](target) and image links: ![alt](target).
# - target may be wrapped in <...> for spaces (rare, but handle it).
# - Not greedy on the [...] part; nested brackets in link text are unusual
# in our content and would need a pure-PEG parser to handle safely.
LINK_RE = re.compile(
r"(?P<bang>!?)\[(?P<text>[^\]]*)\]\((?P<target><[^>]+>|[^)\s]+)(?:\s+\"[^\"]*\")?\)"
)
# Match a fenced-code-block opener/closer: ``` or ~~~ optionally followed by
# an info string. Quarto/Pandoc allow attribute braces (```{python}) too.
FENCE_OPEN_RE = re.compile(r"^(?P<indent>\s{0,3})(?P<fence>`{3,}|~{3,})(?P<info>.*)$")
# Strip inline code spans `...` so a backticked target like `[x](y)` inside
# a paragraph isn't treated as a link.
INLINE_CODE_RE = re.compile(r"`[^`\n]*`")
# Strip HTML comments (single- or multi-line). Authors stash TikZ source,
# review notes, and other non-rendered fragments in `<!-- ... -->` blocks;
# the LINK_RE regex would otherwise match TikZ syntax like
# `\node[mycycle](B1){GPU 1};` as a Markdown link `[mycycle](B1)`.
# Replace each non-newline char with a space to preserve line numbers and
# column offsets, so any *real* failures we report stay accurate.
HTML_COMMENT_RE = re.compile(r"<!--.*?-->", re.DOTALL)
# Quarto / Pandoc cross-reference targets. These resolve at render time
# against the project-wide cross-ref index, not against the filesystem,
# so the link checker cannot validate them offline. The book toolchain
# (`book-check-references`) owns this validation.
# Examples we must skip: `[See sec](@sec-foo)`, `[Fig](@fig-bar)`,
# `[Tbl](@tbl-baz)`, `[Eq](@eq-qux)`, `[Lst](@lst-x)`.
QUARTO_XREF_PREFIXES = ("@sec-", "@fig-", "@tbl-", "@eq-", "@lst-", "@thm-",
"@cor-", "@def-", "@exr-", "@exm-", "@prp-")
# Match explicit Quarto / Pandoc anchor syntax inside headings:
# ## My Header {#sec-foo}
ANCHOR_ATTR_RE = re.compile(r"\{#(?P<id>[^\s}]+)[^}]*\}")
# Match ATX headings to derive their slugified id.
HEADING_RE = re.compile(r"^(#{1,6})\s+(?P<heading>.+?)\s*$", re.MULTILINE)
# Strip Quarto callouts and cross-ref macros from headings before slugifying.
HEADING_CLEAN_RE = re.compile(r"\{[^}]*\}|`[^`]*`")
# Schemes that are external and out of scope for this checker.
EXTERNAL_SCHEMES = (
"http://",
"https://",
"mailto:",
"tel:",
"ftp://",
"ftps://",
"ssh://",
"git://",
"data:",
"javascript:",
)
@dataclass(frozen=True)
class Problem:
file: Path
line: int
target: str
detail: str
def render(self, root: Path) -> str:
try:
rel = self.file.relative_to(root)
except ValueError:
rel = self.file
return f"{rel}:{self.line}: broken link → {self.target!r} ({self.detail})"
def slugify(heading: str) -> str:
"""Approximate Pandoc's `--id-prefix=section` slug for an ATX heading.
Matches Pandoc's `auto_identifiers` extension well enough for our content:
- Strip leading non-alphanumerics.
- Replace whitespace with single hyphens.
- Lowercase.
- Drop chars that aren't alnum, hyphen, underscore, period, or colon.
"""
cleaned = HEADING_CLEAN_RE.sub("", heading).strip()
# Pandoc strips leading non-letter characters from the slug.
cleaned = re.sub(r"^[^A-Za-z]+", "", cleaned)
cleaned = cleaned.lower()
cleaned = re.sub(r"\s+", "-", cleaned)
cleaned = re.sub(r"[^a-z0-9_\-.:]", "", cleaned)
return cleaned
def github_slugify(heading: str) -> str:
"""Approximate GitHub's heading-anchor slug (used in `.md` READMEs).
GitHub's algorithm differs from Pandoc in two important ways:
- It does NOT strip leading non-letter chars, so emoji-prefixed
headings like `## 🎯 20 Progressive Modules` produce anchors
that start with a hyphen (`-20-progressive-modules`) because
the emoji collapses to nothing while the trailing space
becomes a leading hyphen.
- It removes punctuation but preserves underscores and hyphens.
Generating both slugs and registering both as valid anchors keeps the
checker honest for repos that mix Quarto-rendered (.qmd) and
GitHub-rendered (.md) content. When the two algorithms agree the
extra entry is harmless.
"""
cleaned = HEADING_CLEAN_RE.sub("", heading)
cleaned = cleaned.lower()
# Drop everything that isn't a word char, whitespace, or hyphen.
cleaned = re.sub(r"[^\w\s\-]", "", cleaned)
cleaned = re.sub(r"\s+", "-", cleaned)
return cleaned
def collect_anchors(text: str) -> set[str]:
"""Return the set of anchor ids defined in the given Markdown source.
Combines:
- Explicit `{#id}` attributes anywhere in the document.
- Auto-generated heading slugs.
"""
anchors: set[str] = set()
for m in ANCHOR_ATTR_RE.finditer(text):
anchors.add(m.group("id"))
for m in HEADING_RE.finditer(text):
heading = m.group("heading").strip()
# If a heading already declares {#id}, the explicit id wins; capture it
# AND skip the slugified form because the auto slug isn't generated.
explicit = ANCHOR_ATTR_RE.search(heading)
if explicit:
continue
slug = slugify(heading)
if slug:
anchors.add(slug)
gh_slug = github_slugify(heading)
if gh_slug:
anchors.add(gh_slug)
return anchors
def is_external(target: str) -> bool:
# Match schemes case-insensitively. Authors occasionally type `HTTP://` or
# `HTTPS://` (especially after auto-capitalize on mobile or in copied
# academic citations); these are still external URLs and out of scope for
# this offline checker. A strict prefix match would otherwise treat them
# as relative paths and report bogus failures.
head = target[:8].lower()
return any(head.startswith(s) for s in EXTERNAL_SCHEMES)
def is_quarto_xref(target: str) -> bool:
"""True for Quarto cross-ref targets like `@sec-foo`, `@fig-bar`.
These resolve through the project's cross-ref index at render time, not
through the filesystem. Validation belongs to the book toolchain's
`book-check-references` hook, not here.
"""
return target.startswith(QUARTO_XREF_PREFIXES)
def strip_html_comments(text: str) -> str:
"""Erase HTML comment bodies while preserving line/column offsets.
Each non-newline character inside a `<!-- ... -->` block becomes a
space; newlines stay intact. This keeps any subsequent line-number /
column reporting accurate, while preventing the link regex from
matching link-shaped tokens that are author-visible only (TikZ
source, review notes, etc.).
"""
def _blank(match: re.Match) -> str:
return re.sub(r"[^\n]", " ", match.group(0))
return HTML_COMMENT_RE.sub(_blank, text)
def split_target(target: str) -> tuple[str, str]:
"""Split a link target into (path, anchor)."""
if target.startswith("<") and target.endswith(">"):
target = target[1:-1]
if "#" not in target:
return target, ""
path, _, anchor = target.partition("#")
return path, anchor
def candidate_paths(source: Path, raw: str) -> list[Path]:
"""Return possible filesystem paths a link target could resolve to.
Quarto sites resolve relative links against the file's directory, but
`.qmd` files often link to a sibling without the extension (the rendered
output is `.html`). We also accept `index.qmd` shorthand for directory
targets.
"""
if not raw:
return []
base = source.parent
if raw.startswith("/"):
# Site-absolute paths are resolved at render time and depend on the
# site's configured `site-url` / base path. We can't validate them
# offline reliably, so skip with a soft pass.
return []
raw_path = Path(raw)
candidates = [base / raw_path]
# Sometimes authors write `foo` when they mean `foo.qmd` (rendered as
# `foo.html`). Only meaningful when raw has no extension.
if not raw_path.suffix:
candidates.append((base / raw_path).with_suffix(".qmd"))
candidates.append((base / raw_path).with_suffix(".md"))
candidates.append((base / raw_path).with_suffix(".html"))
candidates.append(base / raw_path / "index.qmd")
candidates.append(base / raw_path / "index.md")
elif raw_path.suffix == ".html":
# `foo.html` in source most likely points at sibling `foo.qmd`.
candidates.append((base / raw_path).with_suffix(".qmd"))
candidates.append((base / raw_path).with_suffix(".md"))
return candidates
def file_text_cache() -> "dict[Path, str]":
return {}
def read(path: Path, cache: dict[Path, str]) -> str | None:
if path in cache:
return cache[path]
try:
text = path.read_text(encoding="utf-8")
except (OSError, UnicodeDecodeError):
return None
cache[path] = text
return text
def check_file(path: Path, cache: dict[Path, str]) -> list[Problem]:
if not path.exists():
return [Problem(path, 0, "", "file does not exist")]
text = read(path, cache)
if text is None:
return [Problem(path, 0, "", "file unreadable as UTF-8")]
own_anchors = collect_anchors(text)
problems: list[Problem] = []
# Erase HTML comment bodies BEFORE scanning. We compute anchors against
# the unmodified text (you can legitimately put a {#id} inside a comment
# that documents something), but we never want to *follow* link-shaped
# tokens out of a comment block.
scan_text = strip_html_comments(text)
in_fence = False
fence_marker: str | None = None # The exact `` `... `` or `~~~...` that opened the block.
for line_no, line in enumerate(scan_text.splitlines(), start=1):
# Track fenced code blocks (``` or ~~~). Inside a fence, skip all link
# parsing — TikZ, raw LaTeX, and code samples otherwise produce piles of
# false positives that look like Markdown links.
fence_match = FENCE_OPEN_RE.match(line)
if fence_match:
marker = fence_match.group("fence")[0] * 3 # normalize length to 3
if not in_fence:
in_fence = True
fence_marker = marker
elif fence_marker is not None and line.lstrip().startswith(fence_marker):
in_fence = False
fence_marker = None
continue
if in_fence:
continue
# Strip inline code so backticked link-shaped strings don't trip us.
scan_line = INLINE_CODE_RE.sub("", line)
for m in LINK_RE.finditer(scan_line):
target = m.group("target")
if not target or is_external(target) or is_quarto_xref(target):
continue
path_part, anchor = split_target(target)
if not path_part:
# Pure same-file anchor.
if anchor and anchor not in own_anchors:
problems.append(
Problem(path, line_no, target, "anchor not found in this file")
)
continue
cands = candidate_paths(path, path_part)
if not cands:
continue # Site-absolute or unparsable; skip.
resolved = next((c for c in cands if c.exists()), None)
if resolved is None:
problems.append(
Problem(path, line_no, target, f"no such file (checked {len(cands)} candidate paths)")
)
continue
if anchor:
target_text = read(resolved, cache)
if target_text is None:
# File exists but isn't text we can scan (e.g. a binary).
# Treat anchor as opaque; don't flag.
continue
target_anchors = collect_anchors(target_text)
if anchor not in target_anchors:
problems.append(
Problem(path, line_no, target, f"anchor #{anchor} not found in {resolved.name}")
)
return problems
def staged_files(root: Path) -> list[Path]:
cmd = ["git", "diff", "--cached", "--name-only", "--diff-filter=ACMR"]
try:
out = subprocess.check_output(cmd, cwd=root, text=True)
except (subprocess.CalledProcessError, FileNotFoundError) as exc:
sys.stderr.write(f"check-internal-links: failed to list staged files: {exc}\n")
sys.exit(2)
files = []
for line in out.splitlines():
line = line.strip()
if not line:
continue
if not (line.endswith(".md") or line.endswith(".qmd")):
continue
files.append((root / line).resolve())
return files
def discover_files(root: Path) -> list[Path]:
files: list[Path] = []
for glob in DEFAULT_GLOBS:
for candidate in root.glob(glob):
if any(part in EXCLUDE_DIRS for part in candidate.parts):
continue
files.append(candidate)
return files
def main(argv: list[str]) -> int:
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("files", nargs="*", help="Specific .md/.qmd files to check.")
parser.add_argument("--staged", action="store_true", help="Check files staged for commit.")
parser.add_argument("--quiet", "-q", action="store_true", help="Only print failures.")
parser.add_argument(
"--exclude",
action="append",
default=[],
metavar="GLOB",
help="Exclude files whose repo-relative path matches this glob. Repeatable.",
)
args = parser.parse_args(argv)
if args.staged and args.files:
parser.error("--staged is mutually exclusive with explicit FILES.")
root = REPO_ROOT
if args.staged:
files = staged_files(root)
elif args.files:
files = []
for raw in args.files:
p = Path(raw)
if not p.is_absolute():
p = (root / p).resolve()
if p.suffix not in (".md", ".qmd"):
continue
if any(part in EXCLUDE_DIRS for part in p.parts):
continue
files.append(p)
else:
files = discover_files(root)
if args.exclude:
import fnmatch
kept = []
for f in files:
try:
rel = str(f.relative_to(root))
except ValueError:
rel = str(f)
if any(fnmatch.fnmatch(rel, pat) for pat in args.exclude):
continue
kept.append(f)
files = kept
if not files:
if not args.quiet:
print("check-internal-links: no .md/.qmd files to check.")
return 0
cache: dict[Path, str] = {}
all_problems: list[Problem] = []
for path in sorted(set(files)):
all_problems.extend(check_file(path, cache))
if all_problems:
for prob in all_problems:
print(prob.render(root))
print(f"\ncheck-internal-links: {len(all_problems)} broken internal link(s) in {len({p.file for p in all_problems})} file(s).", file=sys.stderr)
return 1
if not args.quiet:
print(f"check-internal-links: OK ({len(files)} file(s) scanned).")
return 0
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))