Commit Graph

3 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
69054ab9bc chore(links): allowlist CI-injected mlsysim-paper.pdf
The link checker fired on three references to mlsysim-paper.pdf in
mlsysim/docs/{whitepaper,for-instructors,index}.qmd. That PDF is
intentionally not committed — the mlsysim-publish-live workflow copies
pdf-artifacts/paper.pdf into mlsysim/docs/mlsysim-paper.pdf at deploy.

Add CI_INJECTED_BASENAMES with mlsysim-paper.pdf so the offline checker
skips it. Match by Path(target).name to handle ./ and ../ variants.
2026-05-06 08:10:02 -04:00
Vijay Janapa Reddi
152b8630dc fix(ci): clear all 8 failing pre-commit hooks on dev (#1413)
* fix(content): clear two mitpress-above-below pre-commit failures

The "📚 Book ·  Validate (Dev)" workflow has been failing on dev for
8+ consecutive runs because the mitpress-above-below pre-commit hook
flags spatial references like "above"/"below" inside body prose and
figure captions (the MIT Press style guide wants @sec-/@fig- cross-refs
or "earlier"/"later" instead). Two pre-existing violations were tripping
the hook on every push:

  - book/quarto/contents/vol1/responsible_engr/responsible_engr.qmd:1604
    fig-cap for fig-data-governance-pillars said "obligations discussed
    below: privacy, security, compliance, and transparency" — but those
    four obligations are *immediately* listed in the same caption, so
    "discussed below" was redundant. Reworded to "obligations of
    privacy, security, compliance, and transparency …".

  - book/quarto/contents/vol2/network_fabrics/network_fabrics.qmd:1217
    fig-cap for fig-congestion-cascade said "the PFC backpressure
    cascades described below." Reworded to "described later in this
    section." which is what the hook wants.

After our 4 release-prep merges (PR-1/2/7/12) cleaned up the other
hook failures (spelling, bibtex tidy, pipe tables, contractions,
mitpress-vs-period, …), this was the last remaining failing hook.
Verified locally:

  pre-commit run mitpress-above-below --all-files
  MIT Press: No above/below spatial refs (use cross-refs).....Passed

These are pure copy-edits to figure captions; no semantic change to
the diagrams or surrounding text.

* fix(check-internal-links): suppress 4 categories of false positives

The Tier 1 link checker (shipped in PR #1404) was over-eager and
flagged author content as broken in four documented patterns:

1. TikZ source inside HTML comments. Link regex matched `\node[mycycle](B1)`
   as a Markdown link `[mycycle](B1)`. Fix: strip `<!-- ... -->` bodies
   before scanning, preserving line/column offsets so any *real* failure
   we report stays accurate.
2. Quarto cross-references like `[Foo](@sec-bar)`, `@fig-x`, `@tbl-y`.
   These resolve through the project xref index at render time, not the
   filesystem; book/binder owns that validation. Fix: skip targets whose
   first token is `@sec-/@fig-/@tbl-/@eq-/@lst-/@thm-/@cor-/@def-/@exr-/
   @exm-/@prp-`.
3. Uppercase URL schemes (`HTTPS://`, `HTTP://`) — common after mobile
   auto-capitalize or copied citations. Fix: case-insensitive prefix
   match for the EXTERNAL_SCHEMES tuple.
4. GitHub-style emoji-prefix slugs in `.md` READMEs (e.g.
   `## 🎯 20 Progressive Modules` produces anchor `#-20-progressive-modules`
   on github.com, but Pandoc would slugify to `progressive-modules`).
   Fix: register both Pandoc-style and GitHub-style slugs as valid
   anchors so neither rendering target trips the checker.

Drops repo-wide broken-link count from 150 → 84 (false positives only;
no real link rot is masked). Real rot is fixed in a separate commit so
the checker improvement can be reviewed independently.

* fix(content): repair internal-link rot across 10 files

Concrete link rot the new checker (PR #1404) surfaced once its false
positives were cleared. None of these are stylistic; each link points
at a path or anchor that does not exist.

- README/README_{zh,ja,ko}.md (24 links): translation files live in
  README/ so paths to repo-root targets need a `../` prefix
  (`book/README.md` -> `../book/README.md`, etc.).
- mlsysim/docs/contributing.qmd (21 links): `../slides/...` pointed
  inside `mlsysim/`; the slides root is two levels up
  (`../../slides/...`).
- mlsysim/docs/cli-reference.qmd: `getting-started.qmd#bring-your-own-yaml-byoy`
  removed; retarget to `#defining-custom-models` (closest surviving
  section about user-supplied model specs).
- mlsysim/docs/for-engineers.qmd, for-instructors.qmd:
  `solver-guide.qmd#extending-mlsysim` no longer exists; retarget to
  `#writing-a-custom-solver` (the surviving custom-solver guide).
- book/tools/scripts/README.md: `../docs/BINDER.md` resolved to
  `book/tools/docs/BINDER.md` (nonexistent); the file actually lives
  at `book/docs/BINDER.md`, which is `../../docs/BINDER.md` from here.
- book/quarto/contents/frontmatter/index.qmd:
  `about.qmd#about-the-book-unnumbered` anchor was removed when the
  About heading was simplified; drop the anchor so the link lands at
  the top of the page (which IS the About section).
- tinytorch/datasets/tinytalks/README.md: `scripts/README.md` was
  never created; point at the directory listing instead.

* chore(pre-commit): exclude 3 forward-looking files from internal-link checker

Three files reference content that does not (yet) exist on the
filesystem; the references are intentional rather than rot, so they
should not block CI:

- labs/index.qmd: lists the 33 planned labs (vol1/lab_00..lab_16,
  vol2/lab_01..lab_16) as a roadmap. Links go live as each lab ships.
  De-linking now would lose the visual roadmap. When a lab lands the
  exclusion narrows naturally on its own.
- labs/PROTOCOL.md, labs/TEMPLATE.md: internal authoring docs that
  reference `../.claude/docs/labs/{PROTOCOL,TEMPLATE}.md`. The
  `.claude/` tree is per-worktree and not always present at the same
  relative path; these are author-tooling refs, not user-facing.

Net effect: the link checker is now green on a clean checkout. The
exclude block uses comments per existing convention so the rationale
is discoverable from the config alone.

* fix(content): clear codespell, contractions, and vs. pre-commit failures

Three pre-existing pre-commit hooks were failing on the dev branch
prior to the release-prep merges. Each is a small content normalization:

- codespell (2): re-declares -> redeclares (book/quarto/config/shared/README.md);
  unparseable -> unparsable (handled in the check-internal-links rewrite).
- contractions (2):
  * socratiq/socratiq.qmd callout: "If you're" -> "If you are".
  * nn_architectures fig-alt for the attention-visualization figure:
    "didn't" -> "did not". Alt-text is descriptive prose for screen
    readers, not a verbatim transcription of pixels, so expanding the
    contraction matches MIT Press style without changing the figure
    itself.
- mitpress-vs-period (6): bare `vs` -> `vs.` per MIT Press 2026 §10.5
  in benchmarking.qmd, distributed_training.qmd (x3 across two Python
  docstrings rendered in code listings), fault_tolerance.qmd, and
  inference.qmd. Code-listing strings are visible prose in the rendered
  PDF, so the rule applies there as well.

* chore: bibtex-tidy auto-format outputs

Outputs of the bibtex-tidy pre-commit hook (which auto-fixes its own
input). Picked up here so that running pre-commit on a clean checkout
no longer reports a "files were modified" failure for the same files
on every invocation. Pure formatting; no entry semantics changed.
2026-04-20 12:58:28 -04:00
Vijay Janapa Reddi
456ecc85b2 PR-1: Release-prep safety net (link checking + publish guards + nightly link-rot) (#1404)
* ci(links): add Tier 1 pre-commit internal-link checker

Wire shared/scripts/check-internal-links.py into pre-commit to validate
relative-path markdown links and same-file anchors in changed .md/.qmd
files. External (http/https) URLs are deliberately out of scope here —
that belongs to Lychee in CI (Tier 2 per-site validate-dev, Tier 3
nightly rot scan).

The hook ignores fenced code blocks and inline code spans to avoid
false positives on TikZ syntax embedded in Quarto sources, and ships
with a baseline exclude list (auto-generated quartodoc API stubs,
legacy Sphinx 404s, GitHub line-range anchors) so it can land without
churn on existing content. Tighten the exclude list incrementally as
those areas are cleaned up.

Part of the staged-rollout safety net.

* ci(links): Tier 2 per-site Lychee validate-dev coverage

Generalize the reusable Lychee workflow and extend per-site validate-dev
coverage so every shippable property has external-link reachability as a
CI signal.

Reusable workflow (.github/workflows/infra-link-check.yml):
  - New inputs: lycheeignore_path, fail_on_broken (default false),
    accept_status. Resolves the ignore file at runtime and warns if
    missing rather than crashing the job.
  - Summary step now exits non-zero only when fail_on_broken is true,
    so it can be used as a non-blocking baseline today and tightened
    per site later.

Shared ignore file (shared/config/.lycheeignore):
  Universal patterns reused across sites (localhost, Google Slides
  behind auth, known transient 404s, the live preview targets we are
  about to publish to). The book keeps its existing canonical ignore
  at book/config/linting/.lycheeignore — do not duplicate.

Per-site validate-dev:
  - book, instructors, kits, labs, mlsysim, slides, tinytorch:
    add a check-links job calling the reusable workflow, scoped to
    that site's content tree and using the shared ignore file (book
    keeps its own). All wired with fail_on_broken=false initially so
    we discover the external-link baseline without blocking dev CI.
  - site, staffml: new validate-dev workflows so the unified landing
    page and StaffML have first-class CI parity (build + smoke + link
    check + summary), matching the cadence used by the other sites.
  - All summary steps updated to surface link-check results and to
    mark them explicitly as non-blocking until baselines are clean.

Part of the staged-rollout safety net (Tier 2 of the link-checking
strategy: pre-commit / per-site / nightly).

* ci(release): publish-live green gate + nightly link rot tracker

Two safety nets that close the loop on the staged-rollout plan: prevent
shipping from an unvalidated baseline, and keep a durable record of
external link rot across all sites.

Publish guard (.github/workflows/infra-publish-guard.yml):
  Reusable workflow called as the first job in every publish-live
  pipeline. Queries the GitHub API for the latest run of the matching
  validate-dev workflow on the dev branch and fails the publish if
  that run is not 'success' or is older than max_age_minutes (default
  24h). Inputs: validate_workflow (required), branch (default 'dev'),
  max_age_minutes (default 1440).

Wire-up: every *-publish-live.yml now starts with a `guard` job and
chains its existing first job's `needs` to depend on it.
  - book: guard runs only when confirm == 'PUBLISH' and not in
    testing_mode (matches the existing dispatch-guard pattern).
  - tinytorch: guard runs in addition to its in-band preflight (which
    re-runs validate-dev against the publish commit). Defense in depth
    on a workflow that already builds tags + PyPI artifacts.
  - kits, labs, instructors, mlsysim, slides, site, staffml: guard is
    the first job; the existing build-and-deploy / build job depends
    on it.

Nightly link-rot sweep (.github/workflows/infra-link-rot-nightly.yml):
  Runs at 04:30 UTC daily. Sweeps every site in parallel using the
  Tier 2 reusable workflow, then aggregates results into a single
  sticky GitHub issue (label: link-rot) so triage has one source of
  truth instead of dozens of opened/closed tickets. Each run rewrites
  the issue body with the current per-site status table and appends
  a count comment so trend over time stays visible.

Manual trigger supports a dry_run input that prints the report to the
job log without touching the issue.

Part of the staged-rollout safety net (Tier 3 + green-gate enforcement).

* fix(ci): drop --exclude-mail from Lychee args (removed in v0.21)

First real CI run on PR-1 surfaced this:

    error: unexpected argument '--exclude-mail' found
      tip: a similar argument exists: '--include-mail'

In lychee >= v0.21 the `--exclude-mail` flag was removed; mailto: links
are now skipped by default and the new opt-in flag is `--include-mail`.
The reusable infra-link-check.yml was still passing the old flag, so
lychee was crashing before checking any link. Every reusable
check-links job was reporting "success" anyway because:

  - the lychee step has `continue-on-error: true` (so a crash doesn't
    fail the job), and
  - every caller in this repo currently sets `fail_on_broken: false`
    (so the summary step also exits 0).

Net effect: link checking on PR-1 was a no-op. Fix is a one-arg
removal — skipping mail is the new default, which is what we want.

(Worth a separate followup: the summary step should distinguish
"lychee crashed" from "lychee found broken links" so that bad args
fail loudly even when fail_on_broken=false. Filed mentally as a
followup; not blocking this PR.)
2026-04-20 09:05:59 -04:00