[PR #1409] [MERGED] PR-5: Cutover skeletons (rollback-legacy + redirect map + sitemap aggregator) #6529

Closed
opened 2026-04-21 22:23:47 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/harvard-edge/cs249r_book/pull/1409
Author: @profvjreddi
Created: 4/19/2026
Status: Merged
Merged: 4/19/2026
Merged by: @profvjreddi

Base: devHead: release-prep/cutover-skeletons


📝 Commits (3)

  • d71dd6f feat(launch): rollback-legacy.sh — snapshot + restore the gh-pages root
  • 01dbce4 feat(seo): redirect-map skeleton + HTML-stub generator
  • cc9a19c feat(seo): aggregate per-subsite sitemaps into mlsysbook.ai/sitemap.xml

📊 Changes

5 files changed (+834 additions, -0 deletions)

View changed files

.github/workflows/infra-build-sitemap.yml (+155 -0)
shared/config/redirect-map.json (+70 -0)
shared/scripts/build-redirects.py (+204 -0)
shared/scripts/build-sitemap.py (+172 -0)
shared/scripts/rollback-legacy.sh (+233 -0)

📄 Description

Summary

Three thin scripts/configs that the actual cutover (legacy
mlsysbook.ai → unified landing) will rely on. Skeletons only — no
behavior change yet, no workflow that runs them in CI. Lets us
review the shape now and wire up the runners as part of the
launch sequence rather than the prep sequence.

What's in this PR

1. shared/scripts/rollback-legacy.sh — gh-pages root snapshot/restore.

The cutover replaces the gh-pages root content (currently the legacy
single-volume book) with the new unified landing. If something breaks
post-cutover, we need to be able to restore the legacy site fast.
Script supports two operations:

  • snapshot — clones gh-pages, archives the current root content
    (everything except subsite directories) to a timestamped tarball
    under shared/_snapshots/. Run this BEFORE the cutover.
  • restore <tarball> — clones gh-pages, replaces the root content
    with the snapshot, force-pushes. Run this if cutover needs reverting.

Subsite directories (vol1/, vol2/, tinytorch/, kits/,
labs/, slides/, instructors/, mlsysim/, staffml/, site/,
assets/) are explicitly preserved by both operations — they have
their own publish workflows and aren't part of the legacy root.

2. shared/config/redirect-map.json + shared/scripts/build-redirects.py.

The legacy book has dozens of indexed deep-links into chapters that
need 301 redirects to their Volume I equivalents (and a handful that
move to Volume II). Approach:

  • redirect-map.json is the source of truth. Each entry has
    from (legacy path), to (canonical new URL), reason (why
    the URL changed; documentation for future maintainers), and
    status (active, pending, or archive).
  • build-redirects.py reads the JSON, generates HTML stubs at the
    from paths with both a <meta http-equiv=\"refresh\"> and a
    <link rel=\"canonical\"> for each entry. GitHub Pages doesn't
    support real 301s, so HTML-stub redirect is the standard pattern.
  • The current map is a SKELETON with the most-linked legacy URLs
    populated. Will be expanded based on actual analytics referrer
    data before cutover.

3. shared/scripts/build-sitemap.py + .github/workflows/infra-build-sitemap.yml.

Quarto generates a sitemap.xml per subsite. Search engines and
LLM crawlers expect a single sitemap (or sitemap index) at
mlsysbook.ai/sitemap.xml. Approach:

  • build-sitemap.py walks gh-pages, finds all per-subsite
    sitemap.xml files, generates a sitemap index at the root
    that lists each one. Sitemap index format (xmlns
    http://www.sitemaps.org/schemas/sitemap/0.9) is the standard
    way to handle multi-property sites without merging URL lists
    (which would break the per-property lastmod timestamps).
  • infra-build-sitemap.yml runs the script on workflow_dispatch
    and on a daily cron. Skeleton — not yet referenced by any
    publish workflow's post-deploy step.

What this PR is NOT

  • Does NOT actually cut over the legacy site (separate launch
    PR/runbook).
  • Does NOT auto-run the rollback script (manual maintainer trigger
    by design — rollback is a deliberate decision).
  • Does NOT yet plug build-redirects.py into any deploy workflow
    (that happens in the launch PR, after the redirect map is filled
    in from referrer analytics).
  • Does NOT yet plug build-sitemap.py into any publish workflow
    (cron-only for now — switching it to post-deploy is a launch task).

Risk surface

  • Pure-additive: no existing workflow runs these scripts, no existing
    config references the new files. Worst case: dead code.
  • The skeleton scripts are linted (shellcheck clean for bash,
    python3 -m py_compile clean for Python).
  • Workflow file passes the fork-safety scanner (no
    vars.* / secrets.* on a pull_request trigger).

Test plan

  • CI: validate-dev workflows run clean (no impact expected).
  • Local: python3 shared/scripts/build-redirects.py --dry-run
    produces the expected stubs from the current skeleton map.
  • Local: python3 shared/scripts/build-sitemap.py --dry-run
    against a built _site/ produces a syntactically-valid
    sitemap index (validate with xmllint --noout).
  • Manual: dispatch infra-build-sitemap.yml once after merge to
    confirm the workflow lifts off (will exit early because no
    gh-pages branch on the head this is dispatched from is OK —
    it queries the actual gh-pages branch).

Followup (launch PR)

  • Fill in redirect-map.json from actual analytics referrer data.
  • Wire build-redirects.py into the site-publish-live.yml (or a
    dedicated infra-publish-redirects.yml) so stubs get re-generated
    on every site publish.
  • Switch infra-build-sitemap.yml from cron-only to also being
    triggered post-publish from each *-publish-live.yml so the
    index stays fresh.
  • Add a snapshot job to infra (or to site-publish-live.yml's
    pre-deploy gate) that runs rollback-legacy.sh snapshot before
    the cutover deploy.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/harvard-edge/cs249r_book/pull/1409 **Author:** [@profvjreddi](https://github.com/profvjreddi) **Created:** 4/19/2026 **Status:** ✅ Merged **Merged:** 4/19/2026 **Merged by:** [@profvjreddi](https://github.com/profvjreddi) **Base:** `dev` ← **Head:** `release-prep/cutover-skeletons` --- ### 📝 Commits (3) - [`d71dd6f`](https://github.com/harvard-edge/cs249r_book/commit/d71dd6fdd0e00f5914c632f3ac3347818ad89a49) feat(launch): rollback-legacy.sh — snapshot + restore the gh-pages root - [`01dbce4`](https://github.com/harvard-edge/cs249r_book/commit/01dbce435678116cbbaa2b85268f5f8c2b4ad05f) feat(seo): redirect-map skeleton + HTML-stub generator - [`cc9a19c`](https://github.com/harvard-edge/cs249r_book/commit/cc9a19c17d9e11ad7e5facd1f8a02c7ffc9320b4) feat(seo): aggregate per-subsite sitemaps into mlsysbook.ai/sitemap.xml ### 📊 Changes **5 files changed** (+834 additions, -0 deletions) <details> <summary>View changed files</summary> ➕ `.github/workflows/infra-build-sitemap.yml` (+155 -0) ➕ `shared/config/redirect-map.json` (+70 -0) ➕ `shared/scripts/build-redirects.py` (+204 -0) ➕ `shared/scripts/build-sitemap.py` (+172 -0) ➕ `shared/scripts/rollback-legacy.sh` (+233 -0) </details> ### 📄 Description ## Summary Three thin scripts/configs that the actual cutover (legacy `mlsysbook.ai` → unified landing) will rely on. Skeletons only — no behavior change yet, no workflow that runs them in CI. Lets us review the *shape* now and wire up the runners as part of the launch sequence rather than the prep sequence. ### What's in this PR **1. `shared/scripts/rollback-legacy.sh` — gh-pages root snapshot/restore.** The cutover replaces the gh-pages root content (currently the legacy single-volume book) with the new unified landing. If something breaks post-cutover, we need to be able to restore the legacy site fast. Script supports two operations: - `snapshot` — clones gh-pages, archives the *current* root content (everything except subsite directories) to a timestamped tarball under `shared/_snapshots/`. Run this BEFORE the cutover. - `restore <tarball>` — clones gh-pages, replaces the root content with the snapshot, force-pushes. Run this if cutover needs reverting. Subsite directories (`vol1/`, `vol2/`, `tinytorch/`, `kits/`, `labs/`, `slides/`, `instructors/`, `mlsysim/`, `staffml/`, `site/`, `assets/`) are explicitly preserved by both operations — they have their own publish workflows and aren't part of the legacy root. **2. `shared/config/redirect-map.json` + `shared/scripts/build-redirects.py`.** The legacy book has dozens of indexed deep-links into chapters that need 301 redirects to their Volume I equivalents (and a handful that move to Volume II). Approach: - `redirect-map.json` is the source of truth. Each entry has `from` (legacy path), `to` (canonical new URL), `reason` (why the URL changed; documentation for future maintainers), and `status` (`active`, `pending`, or `archive`). - `build-redirects.py` reads the JSON, generates HTML stubs at the `from` paths with both a `<meta http-equiv=\"refresh\">` and a `<link rel=\"canonical\">` for each entry. GitHub Pages doesn't support real 301s, so HTML-stub redirect is the standard pattern. - The current map is a SKELETON with the most-linked legacy URLs populated. Will be expanded based on actual analytics referrer data before cutover. **3. `shared/scripts/build-sitemap.py` + `.github/workflows/infra-build-sitemap.yml`.** Quarto generates a `sitemap.xml` per subsite. Search engines and LLM crawlers expect a *single* sitemap (or sitemap index) at `mlsysbook.ai/sitemap.xml`. Approach: - `build-sitemap.py` walks gh-pages, finds all per-subsite `sitemap.xml` files, generates a sitemap *index* at the root that lists each one. Sitemap index format (xmlns `http://www.sitemaps.org/schemas/sitemap/0.9`) is the standard way to handle multi-property sites without merging URL lists (which would break the per-property `lastmod` timestamps). - `infra-build-sitemap.yml` runs the script on `workflow_dispatch` and on a daily cron. Skeleton — not yet referenced by any publish workflow's post-deploy step. ### What this PR is NOT - Does NOT actually cut over the legacy site (separate launch PR/runbook). - Does NOT auto-run the rollback script (manual maintainer trigger by design — rollback is a deliberate decision). - Does NOT yet plug `build-redirects.py` into any deploy workflow (that happens in the launch PR, after the redirect map is filled in from referrer analytics). - Does NOT yet plug `build-sitemap.py` into any publish workflow (cron-only for now — switching it to post-deploy is a launch task). ### Risk surface - Pure-additive: no existing workflow runs these scripts, no existing config references the new files. Worst case: dead code. - The skeleton scripts are linted (`shellcheck` clean for bash, `python3 -m py_compile` clean for Python). - Workflow file passes the fork-safety scanner (no `vars.*` / `secrets.*` on a `pull_request` trigger). ### Test plan - [ ] CI: validate-dev workflows run clean (no impact expected). - [ ] Local: `python3 shared/scripts/build-redirects.py --dry-run` produces the expected stubs from the current skeleton map. - [ ] Local: `python3 shared/scripts/build-sitemap.py --dry-run` against a built `_site/` produces a syntactically-valid sitemap index (validate with `xmllint --noout`). - [ ] Manual: dispatch `infra-build-sitemap.yml` once after merge to confirm the workflow lifts off (will exit early because no gh-pages branch on the head this is dispatched from is OK — it queries the actual `gh-pages` branch). ### Followup (launch PR) - Fill in `redirect-map.json` from actual analytics referrer data. - Wire `build-redirects.py` into the `site-publish-live.yml` (or a dedicated `infra-publish-redirects.yml`) so stubs get re-generated on every site publish. - Switch `infra-build-sitemap.yml` from cron-only to also being triggered post-publish from each `*-publish-live.yml` so the index stays fresh. - Add a snapshot job to `infra` (or to `site-publish-live.yml`'s pre-deploy gate) that runs `rollback-legacy.sh snapshot` before the cutover deploy. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-21 22:23:47 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#6529