mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 18:18:42 -05:00
* feat(launch): rollback-legacy.sh — snapshot + restore the gh-pages root
Add the panic button for the mlsysbook.ai cutover. The staged-rollout
plan keeps the legacy single-volume site at the gh-pages root while
the new properties (Vol I, Vol II, TinyTorch, labs, kits, slides,
mlsysim, instructors, staffml, unified landing) get deployed into
subdirectories. Once everything is verified, the unified landing
page replaces the legacy root — and at exactly that moment we want a
one-command revert path that doesn't require remembering which gh-
pages SHA "the old root" lived at.
Three modes:
snapshot Take a timestamped backup of the legacy root files
(everything at gh-pages root that is NOT a known
subsite directory) and push to legacy-backup/<TS>/.
restore <ID> Copy a snapshot back to root, OVERWRITING current
root files but leaving subsite directories alone.
list List available snapshots.
Design choices worth flagging:
1. Subsite-aware. The script hard-codes the list of top-level
subsite directories (book/, tinytorch/, kits/, labs/, mlsysim/,
slides/, instructors/, interviews/, staffml/, about/, community/,
newsletter/) and excludes them from BOTH snapshot capture AND
restore overwrites. Rolling back the legacy landing page should
never wipe out actively-deployed properties.
2. Dry-run by default. Every destructive mode requires --apply. The
default behavior prints what would happen, including a diff
preview for restore. This is the same posture the existing
sync-mirrors.sh / link-checker / publish-guard scripts take.
3. Snapshots are kept, not moved. Restoring a snapshot is itself a
reversible commit on gh-pages; the snapshot directory is preserved
so a "rollback the rollback" is one more command away.
4. Doesn't touch the working tree. Operates against a fresh shallow
clone in mktemp, so it can be run from any clone of the repo
(developer machine or a GitHub Actions runner) without dirtying
anything local.
Typical sequence on launch day is documented inline at the top of
the script. Two short commands wrap the whole rollout: snapshot
before deploy, restore-by-ID if anything looks wrong.
* feat(seo): redirect-map skeleton + HTML-stub generator
Add the cutover plumbing for legacy-URL → new-URL redirects so the
PageRank accumulated under the old single-volume mlsysbook.ai
structure flows into the new ecosystem URLs (`/book/vol1/`,
`/labs/`, `/about/`, etc.) as soon as the unified landing replaces
the legacy root.
Two artifacts:
1. `shared/config/redirect-map.json` — declarative source of truth.
Schema:
- `from`: legacy path (must start with '/')
- `to`: destination URL or path (resolves against base_url)
- `status`: 301 / 302 / 307 / 308 (default 301)
- `note`: optional human note
A trailing-`*` wildcard is supported in `from` for whole-subtree
moves like `/contents/labs/* → /labs/*`. The file ships
intentionally small: just enough entries to demonstrate the
patterns and seed the launch — populating the full inventory
from the legacy mlsysbook.ai sitemap is a separate task.
2. `shared/scripts/build-redirects.py` — generator.
For each entry it emits a tiny HTML stub at the legacy path
containing:
<meta http-equiv="refresh" content="0;url=<dest>">
<link rel="canonical" href="<dest>">
<meta name="robots" content="noindex,follow">
That combo is the closest GitHub-Pages-friendly equivalent of a
301: real users get redirected in <100ms; crawlers treat the
canonical as authoritative and drop the legacy URL on recrawl;
PageRank flows through. The script ALSO emits a Netlify-format
`_redirects` file from the same map, so the day we move off
GitHub Pages (Cloudflare Pages, Netlify, S3+CF) the same source
of truth produces real 301s with no rewrite.
`--check` mode validates the JSON without writing anything (CI
hook). Wildcards skip stub emission (we'd have to walk the
deployed tree to expand them) but are still emitted to the
Netlify file where they work natively.
Wiring into a *-publish-live workflow is a one-liner step
(`python3 shared/scripts/build-redirects.py --map shared/config/
redirect-map.json --out gh-pages-staging/`) but is intentionally
NOT done in this commit — it should land alongside the actual
unified-landing deploy, when there is something for the legacy
URLs to redirect away from.
* feat(seo): aggregate per-subsite sitemaps into mlsysbook.ai/sitemap.xml
The new ecosystem has every subsite (Vol I, Vol II, TinyTorch, labs,
kits, slides, instructors, mlsysim, staffml, the unified landing)
emitting its own `<subsite>/sitemap.xml` because that's what Quarto
and Next produce automatically. Search engines, however, want a
single authoritative entry point per *domain*. Without an aggregated
index they end up either crawling the subsite sitemaps separately
(if they happen to discover them) or missing some entirely.
This commit adds the aggregator:
shared/scripts/build-sitemap.py
Walks a deployed gh-pages tree, discovers every sitemap.xml under
it (skipping the root one, legacy-backup snapshots, _archive,
_site, and the like), and writes a single sitemap-index.xml at
`<root>/sitemap.xml` that points at each subsite's sitemap as a
`<sitemap><loc>…</loc></sitemap>` entry. It also creates or
appends to `<root>/robots.txt` so the index is surfaced to
crawlers via the standard `Sitemap:` directive.
Optional `--include-subsite` allowlist (repeatable) for staged
rollouts where we want the index to advertise only the subsites
that have been verified live, even if other ones happen to be
deployed in the tree. Defaults to "everything found".
`--check` does discovery without writing.
.github/workflows/infra-build-sitemap.yml
Reusable workflow (`workflow_call`) wrapping the script so any
`*-publish-live` workflow can refresh the index as its final
step. Also `workflow_dispatch`-able for manual rebuilds. Joins
the existing `gh-pages-deploy` concurrency group so it never
races a publish push.
Uses sparse-checkout to grab just the script from `dev` (no need
to clone the whole monorepo into the runner) and a full clone of
`gh-pages` to do the work.
Wiring into per-subsite publish workflows happens in a follow-up
commit alongside the actual launch — this PR is "skeletons", and
the per-publish trigger is best landed when each subsite's launch
PR ships.
173 lines
5.9 KiB
Python
Executable File
173 lines
5.9 KiB
Python
Executable File
#!/usr/bin/env python3
|
|
"""Aggregate per-subsite sitemap.xml files into a single root-level
|
|
sitemap-index.xml at mlsysbook.ai/sitemap.xml.
|
|
|
|
Why aggregate instead of one-sitemap-per-subsite?
|
|
-------------------------------------------------
|
|
Each subsite (Vol I, Vol II, TinyTorch, labs, …) emits its own
|
|
`sitemap.xml` at `<subsite>/sitemap.xml` because Quarto/Next produce
|
|
those automatically. That works, but search engines need a single
|
|
authoritative entry point per *domain*. Sitemap *indexes* are the
|
|
correct primitive for this: one root file at
|
|
`https://mlsysbook.ai/sitemap.xml` that points to each subsite's
|
|
sitemap.xml as `<sitemap><loc>...</loc></sitemap>` entries.
|
|
|
|
Behavior
|
|
--------
|
|
This script takes a deployed gh-pages tree as input, finds every
|
|
sitemap.xml under it (one per subsite), and writes a single
|
|
`sitemap.xml` at the tree root containing a sitemap-index pointing to
|
|
every per-subsite sitemap. It also emits a `robots.txt` (or appends to
|
|
an existing one) that surfaces the index.
|
|
|
|
Excludes:
|
|
- the root sitemap-index itself (no recursion)
|
|
- `legacy-backup/**` (rollback snapshots are not crawl targets)
|
|
- `_archive/**`, `_drafts/**`, `_site/**` (build artifacts that
|
|
occasionally leak into deploys)
|
|
|
|
Usage
|
|
-----
|
|
build-sitemap.py --root path/to/gh-pages/tree \\
|
|
--base-url https://mlsysbook.ai \\
|
|
[--include-subsite vol1 --include-subsite vol2 ...]
|
|
[--check]
|
|
|
|
--include-subsite Optional allowlist. If passed, only sub-sitemaps under
|
|
these top-level paths will be aggregated. Default is
|
|
"every sitemap.xml found under root, minus exclusions".
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import sys
|
|
from datetime import datetime, timezone
|
|
from pathlib import Path
|
|
|
|
SKIP_PATH_PARTS = {"legacy-backup", "_archive", "_drafts", "_site", ".git"}
|
|
|
|
INDEX_HEADER = (
|
|
'<?xml version="1.0" encoding="UTF-8"?>\n'
|
|
'<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'
|
|
)
|
|
INDEX_FOOTER = "</sitemapindex>\n"
|
|
|
|
|
|
def discover_sitemaps(
|
|
root: Path,
|
|
include: list[str] | None,
|
|
) -> list[Path]:
|
|
"""Walk `root` and return every sitemap.xml that should be indexed,
|
|
skipping the root one and known-excluded subtrees."""
|
|
found: list[Path] = []
|
|
for path in root.rglob("sitemap.xml"):
|
|
# Never include the root-level sitemap (we're about to overwrite it)
|
|
if path == root / "sitemap.xml":
|
|
continue
|
|
rel_parts = path.relative_to(root).parts
|
|
# Skip excluded subtrees
|
|
if any(part in SKIP_PATH_PARTS for part in rel_parts):
|
|
continue
|
|
# Apply allowlist if given (top-level path part must match)
|
|
if include and rel_parts[0] not in set(include):
|
|
continue
|
|
found.append(path)
|
|
found.sort()
|
|
return found
|
|
|
|
|
|
def write_root_index(
|
|
root: Path,
|
|
base_url: str,
|
|
sitemaps: list[Path],
|
|
) -> Path:
|
|
"""Write `<root>/sitemap.xml` as a sitemap-index pointing to each
|
|
discovered per-subsite sitemap."""
|
|
base = base_url.rstrip("/")
|
|
now = datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
|
|
|
lines = [INDEX_HEADER]
|
|
for sm in sitemaps:
|
|
rel = sm.relative_to(root).as_posix()
|
|
loc = f"{base}/{rel}"
|
|
lines.append(" <sitemap>\n")
|
|
lines.append(f" <loc>{loc}</loc>\n")
|
|
lines.append(f" <lastmod>{now}</lastmod>\n")
|
|
lines.append(" </sitemap>\n")
|
|
lines.append(INDEX_FOOTER)
|
|
|
|
target = root / "sitemap.xml"
|
|
target.write_text("".join(lines), encoding="utf-8")
|
|
return target
|
|
|
|
|
|
def update_robots_txt(root: Path, base_url: str) -> Path:
|
|
"""Ensure `<root>/robots.txt` exists and surfaces the sitemap-index."""
|
|
robots = root / "robots.txt"
|
|
sitemap_url = f"{base_url.rstrip('/')}/sitemap.xml"
|
|
line = f"Sitemap: {sitemap_url}\n"
|
|
if robots.exists():
|
|
existing = robots.read_text(encoding="utf-8")
|
|
if sitemap_url in existing:
|
|
return robots
|
|
# Append (preserving any User-agent directives already present)
|
|
if not existing.endswith("\n"):
|
|
existing += "\n"
|
|
robots.write_text(existing + line, encoding="utf-8")
|
|
else:
|
|
robots.write_text("User-agent: *\nAllow: /\n\n" + line, encoding="utf-8")
|
|
return robots
|
|
|
|
|
|
def main() -> int:
|
|
ap = argparse.ArgumentParser(description=__doc__)
|
|
ap.add_argument("--root", required=True, help="Deployed gh-pages tree root")
|
|
ap.add_argument(
|
|
"--base-url",
|
|
default="https://mlsysbook.ai",
|
|
help="Public base URL (default: https://mlsysbook.ai)",
|
|
)
|
|
ap.add_argument(
|
|
"--include-subsite",
|
|
action="append",
|
|
default=None,
|
|
help="Allowlist a subsite by top-level dir name. May be repeated. "
|
|
"If omitted, every discovered sitemap.xml is indexed.",
|
|
)
|
|
ap.add_argument(
|
|
"--check",
|
|
action="store_true",
|
|
help="Discover and report sitemaps; do not write anything.",
|
|
)
|
|
args = ap.parse_args()
|
|
|
|
root = Path(args.root)
|
|
if not root.is_dir():
|
|
print(f"❌ --root '{root}' is not a directory", file=sys.stderr)
|
|
return 2
|
|
|
|
sitemaps = discover_sitemaps(root, args.include_subsite)
|
|
if not sitemaps:
|
|
print("❌ No subsite sitemap.xml files found under root.", file=sys.stderr)
|
|
print(" Subsites are expected to publish their own sitemap.xml at", file=sys.stderr)
|
|
print(" <subsite>/sitemap.xml during build.", file=sys.stderr)
|
|
return 1
|
|
|
|
print(f"📚 Discovered {len(sitemaps)} subsite sitemap(s):")
|
|
for sm in sitemaps:
|
|
print(f" - {sm.relative_to(root)}")
|
|
|
|
if args.check:
|
|
return 0
|
|
|
|
index_path = write_root_index(root, args.base_url, sitemaps)
|
|
robots_path = update_robots_txt(root, args.base_url)
|
|
print(f"✅ Wrote {index_path.relative_to(root)} (sitemap index)")
|
|
print(f"✅ Updated {robots_path.relative_to(root)} (Sitemap: directive)")
|
|
return 0
|
|
|
|
|
|
if __name__ == "__main__":
|
|
raise SystemExit(main())
|