Files
cs249r_book/shared/scripts/find-duplicates.py
Vijay Janapa Reddi 8f09e80c4c PR-3: Scripts, audits, cleanup (build stamp, PDF dropdown, 404s, mirror guard, dedup, RELEASE-PREP) (#1406)
* feat(footer): build-time "last updated" stamp

Add a small build-time stamp to the page footer ("Last updated YYYY-MM-DD
· <site> · <commit>") so readers can see at a glance that the site is
fresh. Quarto's per-page `date-modified` already exists for chapter
pages, but it doesn't capture site-level rebuilds (theme tweaks,
navbar changes, deploy reruns).

Pieces:
  - shared/scripts/inject-build-stamp.sh: wraps a token-replace over a
    build directory. Search-and-replace on `<!-- MLSB_BUILD_STAMP -->`
    means sites that haven't adopted the token are unaffected — opt-in
    rollout per subsite.
  - book/quarto/config/shared/html/footer-common.yml: token added next
    to the existing copyright line in the shared book footer.
  - shared/config/footer-site.yml: token added next to the copyright
    in the unified-site footer.
  - shared/config/site-head.html: minimal CSS for `.mlsb-build-stamp`
    (small, neutral, dark-mode aware).
  - .github/workflows/kits-publish-live.yml: representative wiring —
    runs the stamp step after build and before deploy. Other publish-
    live workflows can adopt the step the same way as they roll
    through release-prep validation.

* feat(navbar): expose paper.pdf for TinyTorch / MLSys·im / StaffML

Each of these subsites already builds a companion paper.tex in CI and
ships the PDF alongside the HTML site. Surface those papers in the
navbar dropdowns where readers actually look for them:

  Build menu:
    - TinyTorch     → site
    - TinyTorch Paper (file-pdf icon, opens in new tab)
                    → /tinytorch/assets/downloads/TinyTorch-Paper.pdf
    - MLSys·im      → site
    - MLSys·im Paper (file-pdf icon, opens in new tab)
                    → /mlsysim/mlsysim-paper.pdf

  Prepare menu (after a separator):
    - StaffML Paper (file-pdf icon, opens in new tab)
                    → /staffml/downloads/StaffML-Paper.pdf

Paper URLs are intentionally kept in lockstep with the build steps in
tinytorch-publish-live (assets/downloads/), mlsysim-publish-live
(site root), and staffml-publish-live (out/downloads/). If a build
path moves, both the workflow and this navbar entry need to move
together — there is no single source.

* feat(404): per-site 404 pages for slides / instructors / unified site

The book, kits, labs, mlsysim, and tinytorch subsites already have
flavored 404.qmd pages that route lost readers to the right
neighborhood. Add the missing three so every subsite under
mlsysbook.ai has a coherent recovery experience instead of falling
back to GitHub Pages' default white-page 404.

  - slides/404.qmd       — slide-deck flavored copy, pointers back to
                           the deck index, the volumes, and the hub.
  - instructors/404.qmd  — instructor-flavored copy, pointers to the
                           course map, slides, and both volumes.
  - site/404.qmd         — landing-page flavored copy, the most
                           ecosystem-wide nav (links to every subsite)
                           because this is the most common 404 source
                           for inbound links from the legacy single-
                           volume mlsysbook.ai.

StaffML already has its own React not-found.tsx so no work needed.
TinyTorch's legacy Sphinx 404.md is preserved for now (still wired on
the Sphinx site that hasn't migrated yet).

* ci(precommit): block subsite-mirror drift on shared assets

Add a pre-commit hook that runs `shared/scripts/sync-mirrors.sh --check`
on every commit. The hook fails if any of the per-subsite real-file
mirrors (subscribe-modal.js, theme SCSS partials, logo) has drifted
from its canonical source in `shared/`.

Why a guard, not just a sync: Quarto's resource-copy step preserves
symlinks instead of dereferencing them, so we have to keep real
copies. Without the guard, "I'll edit the canonical and forget to
re-sync" silently re-introduces the duplicate-divergence bug we just
spent effort fixing. `always_run: true` because a mirror can drift via
deletion of the canonical, not just by editing the canonical itself.

To re-sync after a deliberate change:
  bash shared/scripts/sync-mirrors.sh

* refactor(audit): duplicate-file finder + clean up obvious leftover

Add shared/scripts/find-duplicates.py as a periodic duplication
auditor. It SHA-1 hashes every source-y file across the ecosystem
roots, groups identical contents, subtracts the intentional groups
declared in shared/scripts/sync-mirrors.sh, and reports the rest as
unintended duplicates. JSON report written to .audit/duplicates.json
for CI ingest later; --strict makes it exit non-zero.

Defaults err on the side of being useful out of the box:
  - Skips symlinks (those are deliberate aliases, not duplicates).
  - Skips small files (<256B) — LICENSE stubs, .gitkeep, etc.
  - Skips _site / _build / node_modules / .next / out / .git.
  - Source-y suffix list (.js, .ts, .scss, .css, .html, .yml, .py, .sh).
    Binary assets (images, PDFs) are NOT scanned because their dup
    story is different (logos, icons are intentionally repeated).

Initial-cleanup pass:
  - Delete tinytorch/scripts/cleanup_repo_history.sh — byte-identical
    leftover; the canonical version lives at
    tinytorch/tools/maintenance/cleanup_history.sh and is the one
    referenced by tinytorch/tools/maintenance/README.md.

After this commit the only remaining unintended duplicate is
runHistoryProvider.ts in three vscode-ext packages (kits / labs /
tinytorch). Promoting that into a shared vscode-ext package is real
refactor work — out of scope for release-prep, captured for later.

Add .audit/ and _audit/ (the latter from the Playwright site-audit
script) to .gitignore.

* docs(release-prep): handoff notes covering all five PR groupings

Add a single document at the worktree root that walks through what
this branch contains, why each piece is there, the recommended PR
split (PR-1 safety-net, PR-2 visual polish, PR-3 scripts/audits/
cleanup, PR-4 TinyTorch prep, PR-5 cutover skeletons), what was
intentionally LEFT OUT (and why), and what verification was done
locally vs. what still needs the dev mirror to exercise.

Treat this as the cover memo for the staged-rollout foundation
work; once the five PRs are individually merged into dev, this file
will outlive the branch but the per-PR sections still document why
each piece exists for anyone debugging months from now.
2026-04-19 16:23:26 -04:00

8.6 KiB
Executable File