Files
cs249r_book/shared/scripts/rollback-legacy.sh
Vijay Janapa Reddi 773e106c63 PR-5: Cutover skeletons (rollback-legacy + redirect map + sitemap aggregator) (#1409)
* feat(launch): rollback-legacy.sh — snapshot + restore the gh-pages root

Add the panic button for the mlsysbook.ai cutover. The staged-rollout
plan keeps the legacy single-volume site at the gh-pages root while
the new properties (Vol I, Vol II, TinyTorch, labs, kits, slides,
mlsysim, instructors, staffml, unified landing) get deployed into
subdirectories. Once everything is verified, the unified landing
page replaces the legacy root — and at exactly that moment we want a
one-command revert path that doesn't require remembering which gh-
pages SHA "the old root" lived at.

Three modes:
  snapshot          Take a timestamped backup of the legacy root files
                    (everything at gh-pages root that is NOT a known
                    subsite directory) and push to legacy-backup/<TS>/.
  restore <ID>      Copy a snapshot back to root, OVERWRITING current
                    root files but leaving subsite directories alone.
  list              List available snapshots.

Design choices worth flagging:

1. Subsite-aware. The script hard-codes the list of top-level
   subsite directories (book/, tinytorch/, kits/, labs/, mlsysim/,
   slides/, instructors/, interviews/, staffml/, about/, community/,
   newsletter/) and excludes them from BOTH snapshot capture AND
   restore overwrites. Rolling back the legacy landing page should
   never wipe out actively-deployed properties.

2. Dry-run by default. Every destructive mode requires --apply. The
   default behavior prints what would happen, including a diff
   preview for restore. This is the same posture the existing
   sync-mirrors.sh / link-checker / publish-guard scripts take.

3. Snapshots are kept, not moved. Restoring a snapshot is itself a
   reversible commit on gh-pages; the snapshot directory is preserved
   so a "rollback the rollback" is one more command away.

4. Doesn't touch the working tree. Operates against a fresh shallow
   clone in mktemp, so it can be run from any clone of the repo
   (developer machine or a GitHub Actions runner) without dirtying
   anything local.

Typical sequence on launch day is documented inline at the top of
the script. Two short commands wrap the whole rollout: snapshot
before deploy, restore-by-ID if anything looks wrong.

* feat(seo): redirect-map skeleton + HTML-stub generator

Add the cutover plumbing for legacy-URL → new-URL redirects so the
PageRank accumulated under the old single-volume mlsysbook.ai
structure flows into the new ecosystem URLs (`/book/vol1/`,
`/labs/`, `/about/`, etc.) as soon as the unified landing replaces
the legacy root.

Two artifacts:

1. `shared/config/redirect-map.json` — declarative source of truth.
   Schema:
     - `from`:   legacy path (must start with '/')
     - `to`:     destination URL or path (resolves against base_url)
     - `status`: 301 / 302 / 307 / 308 (default 301)
     - `note`:   optional human note
   A trailing-`*` wildcard is supported in `from` for whole-subtree
   moves like `/contents/labs/* → /labs/*`. The file ships
   intentionally small: just enough entries to demonstrate the
   patterns and seed the launch — populating the full inventory
   from the legacy mlsysbook.ai sitemap is a separate task.

2. `shared/scripts/build-redirects.py` — generator.
   For each entry it emits a tiny HTML stub at the legacy path
   containing:
     <meta http-equiv="refresh" content="0;url=<dest>">
     <link rel="canonical" href="<dest>">
     <meta name="robots" content="noindex,follow">
   That combo is the closest GitHub-Pages-friendly equivalent of a
   301: real users get redirected in <100ms; crawlers treat the
   canonical as authoritative and drop the legacy URL on recrawl;
   PageRank flows through. The script ALSO emits a Netlify-format
   `_redirects` file from the same map, so the day we move off
   GitHub Pages (Cloudflare Pages, Netlify, S3+CF) the same source
   of truth produces real 301s with no rewrite.

   `--check` mode validates the JSON without writing anything (CI
   hook). Wildcards skip stub emission (we'd have to walk the
   deployed tree to expand them) but are still emitted to the
   Netlify file where they work natively.

Wiring into a *-publish-live workflow is a one-liner step
(`python3 shared/scripts/build-redirects.py --map shared/config/
redirect-map.json --out gh-pages-staging/`) but is intentionally
NOT done in this commit — it should land alongside the actual
unified-landing deploy, when there is something for the legacy
URLs to redirect away from.

* feat(seo): aggregate per-subsite sitemaps into mlsysbook.ai/sitemap.xml

The new ecosystem has every subsite (Vol I, Vol II, TinyTorch, labs,
kits, slides, instructors, mlsysim, staffml, the unified landing)
emitting its own `<subsite>/sitemap.xml` because that's what Quarto
and Next produce automatically. Search engines, however, want a
single authoritative entry point per *domain*. Without an aggregated
index they end up either crawling the subsite sitemaps separately
(if they happen to discover them) or missing some entirely.

This commit adds the aggregator:

  shared/scripts/build-sitemap.py
    Walks a deployed gh-pages tree, discovers every sitemap.xml under
    it (skipping the root one, legacy-backup snapshots, _archive,
    _site, and the like), and writes a single sitemap-index.xml at
    `<root>/sitemap.xml` that points at each subsite's sitemap as a
    `<sitemap><loc>…</loc></sitemap>` entry. It also creates or
    appends to `<root>/robots.txt` so the index is surfaced to
    crawlers via the standard `Sitemap:` directive.

    Optional `--include-subsite` allowlist (repeatable) for staged
    rollouts where we want the index to advertise only the subsites
    that have been verified live, even if other ones happen to be
    deployed in the tree. Defaults to "everything found".

    `--check` does discovery without writing.

  .github/workflows/infra-build-sitemap.yml
    Reusable workflow (`workflow_call`) wrapping the script so any
    `*-publish-live` workflow can refresh the index as its final
    step. Also `workflow_dispatch`-able for manual rebuilds. Joins
    the existing `gh-pages-deploy` concurrency group so it never
    races a publish push.

    Uses sparse-checkout to grab just the script from `dev` (no need
    to clone the whole monorepo into the runner) and a full clone of
    `gh-pages` to do the work.

Wiring into per-subsite publish workflows happens in a follow-up
commit alongside the actual launch — this PR is "skeletons", and
the per-publish trigger is best landed when each subsite's launch
PR ships.
2026-04-19 16:22:11 -04:00

234 lines
7.5 KiB
Bash
Executable File

#!/usr/bin/env bash
# =============================================================================
# rollback-legacy.sh — snapshot + restore the legacy site at gh-pages root
# =============================================================================
#
# Why this script exists
# ----------------------
# The mlsysbook.ai launch is a phased rollout. New properties (Vol I, Vol II,
# TinyTorch, labs, kits, slides, instructors, staffml, mlsysim, unified
# landing) are deployed into subdirectories under gh-pages, while the legacy
# single-volume site continues to live at the *root* of mlsysbook.ai. Once
# everything is verified, the unified landing page replaces the legacy root.
#
# This script is the panic button for that root cutover. It does NOT touch
# subsite directories — the safety story for those is "redeploy from main".
#
# Modes
# -----
# snapshot Take a timestamped backup of the legacy root files
# (everything at the gh-pages root that is NOT a known
# subsite directory) and push it to a `legacy-backup/<TS>/`
# path on the gh-pages branch.
#
# restore <ID> Copy a previously-snapshotted set back to the gh-pages
# root, OVERWRITING the current root files but leaving
# subsite directories alone.
#
# list List available snapshots on the gh-pages branch.
#
# Subsite preservation
# --------------------
# A "subsite" is any top-level directory we knowingly publish into. The list
# is hard-coded below (and should be kept in sync with book-publish-live.yml
# and other *-publish-live workflows). Snapshots intentionally exclude these
# so a rollback never wipes out actively-deployed properties.
#
# Safety
# ------
# - Default is --dry-run. Pass --apply to actually push.
# - Always operates against a fresh `gh-pages` clone in a tempdir; never
# mutates the working tree of the calling repo.
# - Refuses to restore a snapshot that doesn't exist.
# - On restore, snapshot path is ALSO retained (we copy, not move) so a
# rollback is itself reversible.
#
# Typical sequence on launch day
# ------------------------------
# 1. Run `./rollback-legacy.sh snapshot --apply` BEFORE deploying the
# unified landing. Note the printed snapshot ID.
# 2. Deploy the unified landing the usual way (book-publish-live with
# target=all).
# 3. Verify mlsysbook.ai. If anything looks wrong:
# ./rollback-legacy.sh restore <SNAPSHOT_ID> --apply
# Wait for gh-pages CDN to invalidate (~5 min on GitHub Pages).
#
# Requires: bash 4+, git, awk. Run from any clone of the repo.
# =============================================================================
set -euo pipefail
# Subsite directories at the gh-pages root that should NEVER be included in
# legacy-snapshots and NEVER be touched on restore. Keep this list in sync
# with the deploy workflows.
SUBSITES=(
book
tinytorch
kits
labs
mlsysim
slides
instructors
interviews
staffml
about
community
newsletter
legacy-backup # don't snapshot snapshots
.git # never touch git internals
)
REPO_URL="${REPO_URL:-https://github.com/harvard-edge/cs249r_book.git}"
GH_PAGES_BRANCH="gh-pages"
DRY_RUN=1
MODE=""
SNAPSHOT_ID=""
usage() {
cat <<'EOF'
Usage:
rollback-legacy.sh snapshot [--apply]
Take a snapshot of the current legacy root on gh-pages.
rollback-legacy.sh restore <SNAPSHOT_ID> [--apply]
Restore a previously-taken snapshot to the legacy root.
rollback-legacy.sh list
List available snapshots.
Flags:
--apply Actually push to gh-pages (default is dry-run).
--repo URL Override clone URL (default: $REPO_URL).
EOF
exit 2
}
# --- Arg parse -----------------------------------------------------------------
[[ $# -eq 0 ]] && usage
MODE="$1"; shift || true
case "$MODE" in
snapshot|list) ;;
restore)
[[ $# -ge 1 ]] || { echo "❌ restore requires a snapshot ID" >&2; usage; }
SNAPSHOT_ID="$1"; shift
;;
-h|--help) usage ;;
*) echo "❌ Unknown mode: $MODE" >&2; usage ;;
esac
while [[ $# -gt 0 ]]; do
case "$1" in
--apply) DRY_RUN=0 ;;
--repo) REPO_URL="$2"; shift ;;
-h|--help) usage ;;
*) echo "❌ Unknown flag: $1" >&2; usage ;;
esac
shift
done
# --- Workspace ----------------------------------------------------------------
WORKDIR="$(mktemp -d -t mlsb-rollback.XXXXXX)"
trap 'rm -rf "$WORKDIR"' EXIT
echo "📁 Workspace: $WORKDIR"
echo "🌐 Cloning $REPO_URL @ $GH_PAGES_BRANCH ..."
git clone --quiet --depth=1 --branch="$GH_PAGES_BRANCH" "$REPO_URL" "$WORKDIR/gh-pages"
cd "$WORKDIR/gh-pages"
# Build a regex of subsite top-level paths to exclude.
exclude_args=()
for s in "${SUBSITES[@]}"; do
exclude_args+=( "--exclude=$s" )
done
# --- Mode dispatch ------------------------------------------------------------
case "$MODE" in
list)
echo "📜 Snapshots on $GH_PAGES_BRANCH:"
if [[ -d legacy-backup ]]; then
(cd legacy-backup && ls -1 | sort)
else
echo " (none — legacy-backup/ does not exist yet)"
fi
;;
snapshot)
TS="$(date -u +%Y%m%dT%H%M%SZ)"
DEST="legacy-backup/$TS"
echo "📸 Creating snapshot $TS ..."
mkdir -p "$DEST"
shopt -s extglob nullglob
# Subsite-aware copy: every top-level entry except known subsites + dotfiles
# we never want in a snapshot (.git, .nojekyll is fine to capture).
for entry in *; do
skip=0
for s in "${SUBSITES[@]}"; do
[[ "$entry" == "$s" ]] && { skip=1; break; }
done
[[ $skip -eq 1 ]] && continue
cp -R "$entry" "$DEST/"
done
shopt -u extglob nullglob
file_count=$(find "$DEST" -type f | wc -l | tr -d ' ')
echo "✅ Snapshot prepared: $DEST ($file_count files)"
if [[ $DRY_RUN -eq 1 ]]; then
echo "🚧 DRY RUN — not pushing. Pass --apply to push."
echo " Snapshot ID would be: $TS"
exit 0
fi
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add "$DEST"
git commit -m "chore(rollback): snapshot legacy root → $DEST"
git push origin "$GH_PAGES_BRANCH"
echo "🎯 Snapshot ID: $TS"
echo " To restore: rollback-legacy.sh restore $TS --apply"
;;
restore)
SRC="legacy-backup/$SNAPSHOT_ID"
if [[ ! -d "$SRC" ]]; then
echo "❌ Snapshot $SNAPSHOT_ID not found at $SRC" >&2
exit 1
fi
echo "♻️ Restoring snapshot $SNAPSHOT_ID to gh-pages root ..."
# Remove root-level files that AREN'T subsites (so the restore is clean
# and stale legacy files don't survive). DON'T touch subsite directories.
shopt -s extglob nullglob
for entry in *; do
skip=0
for s in "${SUBSITES[@]}"; do
[[ "$entry" == "$s" ]] && { skip=1; break; }
done
[[ $skip -eq 1 ]] && continue
rm -rf "$entry"
done
shopt -u extglob nullglob
cp -R "$SRC"/* .
file_count=$(git status --porcelain | wc -l | tr -d ' ')
echo "✅ Restore prepared. $file_count file changes vs current gh-pages."
if [[ $DRY_RUN -eq 1 ]]; then
echo "🚧 DRY RUN — not pushing. Pass --apply to push."
echo " Diff preview (first 40 lines):"
git status --porcelain | head -40
exit 0
fi
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add -A
git commit -m "revert(launch): restore legacy root from snapshot $SNAPSHOT_ID"
git push origin "$GH_PAGES_BRANCH"
echo "🎯 Restore pushed. CDN invalidation typically completes within ~5 minutes."
;;
esac