[PR #1409] [MERGED] PR-5: Cutover skeletons (rollback-legacy + redirect map + sitemap aggregator) #6529

New Issue

GiteaMirror · 2026-04-21T22:23:47-05:00

GiteaMirror commented

2026-04-21 22:23:47 -05:00

📋 Pull Request Information

Original PR: https://github.com/harvard-edge/cs249r_book/pull/1409
Author: @profvjreddi
Created: 4/19/2026
Status: ✅ Merged
Merged: 4/19/2026
Merged by: @profvjreddi

Base: dev ← Head: release-prep/cutover-skeletons

📝 Commits (3)

d71dd6f feat(launch): rollback-legacy.sh — snapshot + restore the gh-pages root
01dbce4 feat(seo): redirect-map skeleton + HTML-stub generator
cc9a19c feat(seo): aggregate per-subsite sitemaps into mlsysbook.ai/sitemap.xml

📊 Changes

5 files changed (+834 additions, -0 deletions)

View changed files

➕ .github/workflows/infra-build-sitemap.yml (+155 -0)
➕ shared/config/redirect-map.json (+70 -0)
➕ shared/scripts/build-redirects.py (+204 -0)
➕ shared/scripts/build-sitemap.py (+172 -0)
➕ shared/scripts/rollback-legacy.sh (+233 -0)

📄 Description

Summary

Three thin scripts/configs that the actual cutover (legacy
mlsysbook.ai → unified landing) will rely on. Skeletons only — no
behavior change yet, no workflow that runs them in CI. Lets us
review the shape now and wire up the runners as part of the
launch sequence rather than the prep sequence.

What's in this PR

1. shared/scripts/rollback-legacy.sh — gh-pages root snapshot/restore.

The cutover replaces the gh-pages root content (currently the legacy
single-volume book) with the new unified landing. If something breaks
post-cutover, we need to be able to restore the legacy site fast.
Script supports two operations:

snapshot — clones gh-pages, archives the current root content
(everything except subsite directories) to a timestamped tarball
under shared/_snapshots/. Run this BEFORE the cutover.
restore <tarball> — clones gh-pages, replaces the root content
with the snapshot, force-pushes. Run this if cutover needs reverting.

Subsite directories (vol1/, vol2/, tinytorch/, kits/,
labs/, slides/, instructors/, mlsysim/, staffml/, site/,
assets/) are explicitly preserved by both operations — they have
their own publish workflows and aren't part of the legacy root.

2. shared/config/redirect-map.json + shared/scripts/build-redirects.py.

The legacy book has dozens of indexed deep-links into chapters that
need 301 redirects to their Volume I equivalents (and a handful that
move to Volume II). Approach:

redirect-map.json is the source of truth. Each entry has
from (legacy path), to (canonical new URL), reason (why
the URL changed; documentation for future maintainers), and
status (active, pending, or archive).
build-redirects.py reads the JSON, generates HTML stubs at the
from paths with both a <meta http-equiv=\"refresh\"> and a
<link rel=\"canonical\"> for each entry. GitHub Pages doesn't
support real 301s, so HTML-stub redirect is the standard pattern.
The current map is a SKELETON with the most-linked legacy URLs
populated. Will be expanded based on actual analytics referrer
data before cutover.

3. shared/scripts/build-sitemap.py + .github/workflows/infra-build-sitemap.yml.

Quarto generates a sitemap.xml per subsite. Search engines and
LLM crawlers expect a single sitemap (or sitemap index) at
mlsysbook.ai/sitemap.xml. Approach:

build-sitemap.py walks gh-pages, finds all per-subsite
sitemap.xml files, generates a sitemap index at the root
that lists each one. Sitemap index format (xmlns
http://www.sitemaps.org/schemas/sitemap/0.9) is the standard
way to handle multi-property sites without merging URL lists
(which would break the per-property lastmod timestamps).
infra-build-sitemap.yml runs the script on workflow_dispatch
and on a daily cron. Skeleton — not yet referenced by any
publish workflow's post-deploy step.

What this PR is NOT

Does NOT actually cut over the legacy site (separate launch
PR/runbook).
Does NOT auto-run the rollback script (manual maintainer trigger
by design — rollback is a deliberate decision).
Does NOT yet plug build-redirects.py into any deploy workflow
(that happens in the launch PR, after the redirect map is filled
in from referrer analytics).
Does NOT yet plug build-sitemap.py into any publish workflow
(cron-only for now — switching it to post-deploy is a launch task).

Risk surface

Pure-additive: no existing workflow runs these scripts, no existing
config references the new files. Worst case: dead code.
The skeleton scripts are linted (shellcheck clean for bash,
python3 -m py_compile clean for Python).
Workflow file passes the fork-safety scanner (no
vars.* / secrets.* on a pull_request trigger).

Test plan

CI: validate-dev workflows run clean (no impact expected).
Local: python3 shared/scripts/build-redirects.py --dry-run
produces the expected stubs from the current skeleton map.
Local: python3 shared/scripts/build-sitemap.py --dry-run
against a built _site/ produces a syntactically-valid
sitemap index (validate with xmllint --noout).
Manual: dispatch infra-build-sitemap.yml once after merge to
confirm the workflow lifts off (will exit early because no
gh-pages branch on the head this is dispatched from is OK —
it queries the actual gh-pages branch).

Followup (launch PR)

Fill in redirect-map.json from actual analytics referrer data.
Wire build-redirects.py into the site-publish-live.yml (or a
dedicated infra-publish-redirects.yml) so stubs get re-generated
on every site publish.
Switch infra-build-sitemap.yml from cron-only to also being
triggered post-publish from each *-publish-live.yml so the
index stays fresh.
Add a snapshot job to infra (or to site-publish-live.yml's
pre-deploy gate) that runs rollback-legacy.sh snapshot before
the cutover deploy.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/harvard-edge/cs249r_book/pull/1409 **Author:** [@profvjreddi](https://github.com/profvjreddi) **Created:** 4/19/2026 **Status:** ✅ Merged **Merged:** 4/19/2026 **Merged by:** [@profvjreddi](https://github.com/profvjreddi) **Base:** `dev` ← **Head:** `release-prep/cutover-skeletons` --- ### 📝 Commits (3) - [`d71dd6f`](https://github.com/harvard-edge/cs249r_book/commit/d71dd6fdd0e00f5914c632f3ac3347818ad89a49) feat(launch): rollback-legacy.sh — snapshot + restore the gh-pages root - [`01dbce4`](https://github.com/harvard-edge/cs249r_book/commit/01dbce435678116cbbaa2b85268f5f8c2b4ad05f) feat(seo): redirect-map skeleton + HTML-stub generator - [`cc9a19c`](https://github.com/harvard-edge/cs249r_book/commit/cc9a19c17d9e11ad7e5facd1f8a02c7ffc9320b4) feat(seo): aggregate per-subsite sitemaps into mlsysbook.ai/sitemap.xml ### 📊 Changes **5 files changed** (+834 additions, -0 deletions) <details> <summary>View changed files</summary> ➕ `.github/workflows/infra-build-sitemap.yml` (+155 -0) ➕ `shared/config/redirect-map.json` (+70 -0) ➕ `shared/scripts/build-redirects.py` (+204 -0) ➕ `shared/scripts/build-sitemap.py` (+172 -0) ➕ `shared/scripts/rollback-legacy.sh` (+233 -0) </details> ### 📄 Description ## Summary Three thin scripts/configs that the actual cutover (legacy `mlsysbook.ai` → unified landing) will rely on. Skeletons only — no behavior change yet, no workflow that runs them in CI. Lets us review the *shape* now and wire up the runners as part of the launch sequence rather than the prep sequence. ### What's in this PR **1. `shared/scripts/rollback-legacy.sh` — gh-pages root snapshot/restore.** The cutover replaces the gh-pages root content (currently the legacy single-volume book) with the new unified landing. If something breaks post-cutover, we need to be able to restore the legacy site fast. Script supports two operations: - `snapshot` — clones gh-pages, archives the *current* root content (everything except subsite directories) to a timestamped tarball under `shared/_snapshots/`. Run this BEFORE the cutover. - `restore <tarball>` — clones gh-pages, replaces the root content with the snapshot, force-pushes. Run this if cutover needs reverting. Subsite directories (`vol1/`, `vol2/`, `tinytorch/`, `kits/`, `labs/`, `slides/`, `instructors/`, `mlsysim/`, `staffml/`, `site/`, `assets/`) are explicitly preserved by both operations — they have their own publish workflows and aren't part of the legacy root. **2. `shared/config/redirect-map.json` + `shared/scripts/build-redirects.py`.** The legacy book has dozens of indexed deep-links into chapters that need 301 redirects to their Volume I equivalents (and a handful that move to Volume II). Approach: - `redirect-map.json` is the source of truth. Each entry has `from` (legacy path), `to` (canonical new URL), `reason` (why the URL changed; documentation for future maintainers), and `status` (`active`, `pending`, or `archive`). - `build-redirects.py` reads the JSON, generates HTML stubs at the `from` paths with both a `<meta http-equiv=\"refresh\">` and a `<link rel=\"canonical\">` for each entry. GitHub Pages doesn't support real 301s, so HTML-stub redirect is the standard pattern. - The current map is a SKELETON with the most-linked legacy URLs populated. Will be expanded based on actual analytics referrer data before cutover. **3. `shared/scripts/build-sitemap.py` + `.github/workflows/infra-build-sitemap.yml`.** Quarto generates a `sitemap.xml` per subsite. Search engines and LLM crawlers expect a *single* sitemap (or sitemap index) at `mlsysbook.ai/sitemap.xml`. Approach: - `build-sitemap.py` walks gh-pages, finds all per-subsite `sitemap.xml` files, generates a sitemap *index* at the root that lists each one. Sitemap index format (xmlns `http://www.sitemaps.org/schemas/sitemap/0.9`) is the standard way to handle multi-property sites without merging URL lists (which would break the per-property `lastmod` timestamps). - `infra-build-sitemap.yml` runs the script on `workflow_dispatch` and on a daily cron. Skeleton — not yet referenced by any publish workflow's post-deploy step. ### What this PR is NOT - Does NOT actually cut over the legacy site (separate launch PR/runbook). - Does NOT auto-run the rollback script (manual maintainer trigger by design — rollback is a deliberate decision). - Does NOT yet plug `build-redirects.py` into any deploy workflow (that happens in the launch PR, after the redirect map is filled in from referrer analytics). - Does NOT yet plug `build-sitemap.py` into any publish workflow (cron-only for now — switching it to post-deploy is a launch task). ### Risk surface - Pure-additive: no existing workflow runs these scripts, no existing config references the new files. Worst case: dead code. - The skeleton scripts are linted (`shellcheck` clean for bash, `python3 -m py_compile` clean for Python). - Workflow file passes the fork-safety scanner (no `vars.*` / `secrets.*` on a `pull_request` trigger). ### Test plan - [ ] CI: validate-dev workflows run clean (no impact expected). - [ ] Local: `python3 shared/scripts/build-redirects.py --dry-run` produces the expected stubs from the current skeleton map. - [ ] Local: `python3 shared/scripts/build-sitemap.py --dry-run` against a built `_site/` produces a syntactically-valid sitemap index (validate with `xmllint --noout`). - [ ] Manual: dispatch `infra-build-sitemap.yml` once after merge to confirm the workflow lifts off (will exit early because no gh-pages branch on the head this is dispatched from is OK — it queries the actual `gh-pages` branch). ### Followup (launch PR) - Fill in `redirect-map.json` from actual analytics referrer data. - Wire `build-redirects.py` into the `site-publish-live.yml` (or a dedicated `infra-publish-redirects.yml`) so stubs get re-generated on every site publish. - Switch `infra-build-sitemap.yml` from cron-only to also being triggered post-publish from each `*-publish-live.yml` so the index stays fresh. - Add a snapshot job to `infra` (or to `site-publish-live.yml`'s pre-deploy gate) that runs `rollback-legacy.sh snapshot` before the cutover deploy. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

GiteaMirror added the pull-request label 2026-04-21 22:23:47 -05:00

GiteaMirror closed this issue

2026-04-21 22:23:48 -05:00

Sign in to join this conversation.

Branches Tags

dev

feat/mlperf-edu-precondition

gh-pages

vol1/all-final

main

vol1/appendices-final

vol1/ch16-final

vol1/ch15-final

vol1/ch14-final

vol1/ch13-final

vol1/ch11-final

vol1/ch12-final

vol1/ch10-final

vol1/ch9-final

vol1/ch8-final

vol1/ch7-final

vol1/ch6-final

vol1/ch5-final

vol1/ch4-final

vol1/ch3-final

vol1/ch2-final

vol1/frontmater-final

kai/fixing-profile-setting-and-map

chore/staffml-ci-path

fix/callout-flow

vol1/ch10-pass4

vol1/ch9-pass4

vol1/ch8-pass4

vol1/ch7-pass4

vol1/ch6-pass4

vol1/ch5-pass4

vol1/apC-pass3

vol1/ch4-pass4

vol1/ch3-pass4

vol1/ch2-pass4

vol1/ch1-pass4

vol1/frontmatter

vol1/apE-pass3

vol1/apD-pass3

fmt-fix

vol1/ch14-pass3

kai/clarify-community-map-totals

vol1/ch13-pass3

vol1/ch12-pass3

vol1/ch11-pass3

vol1/ch10-pass3

vol1/ch7-pass3

vol1/ch9-pass3

vol1/ch8-pass3

vol1/ch6-pass3

vol1/ch5-pass3

vol1/ch4-pass3

vol1/ch3-pass3

vol1/ch2-pass3

vol1/ch1-pass3

vol1/ch6-pass2

vol1/ch5-pass2

vol1/ch4-pass2

vol1/ch3-pass2

vol1/ch2-pass2

fix/badge-fixes

chore/precommit-cleanup

cleanup/book-validate-paths

fix/staffml-trigger-on-workflow-edits

fix/staffml-reusable-concurrency

feat/container-preflight-urls

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#6529