Before this change, the StaffML Next.js dev server fetched scenario and
details (including napkin_math) from the production Cloudflare Worker
even when contributors had local YAML edits — so changes weren't visible
without shipping. The opt-in static-fallback path existed but was wired
incorrectly: getStaticFullDetail used a Function-constructor dynamic
import of ../data/corpus.json, which Turbopack rewrote to a non-existent
/_next/static/data/corpus.json URL and 404'd at runtime.
Fix in three parts:
1. Loader (interviews/staffml/src/lib/corpus.ts): replace the broken
dynamic import with fetch('/data/corpus.json'). On failure, throw a
clear error pointing at `vault build --local`.
2. Build (interviews/vault-cli/src/vault_cli/commands/build.py): mirror
the generated corpus.json into interviews/staffml/public/data/ so
Next serves it as a static asset. Add --local as a clearer alias for
--local-json and update the help text to spell out the dev workflow.
3. Wiring (interviews/staffml/package.json + scripts/build-local-corpus.mjs):
predev now runs `vault build --local` automatically, with a soft-fail
path if the vault CLI isn't installed (so first-time contributors
still get a working dev server, just with the worker fallback). The
committed .env.development sets NEXT_PUBLIC_VAULT_FALLBACK=static so
the static path is the default in dev. Both copies of corpus.json are
gitignored as build artifacts (the YAMLs are the source of truth).
The bundled corpus.json was serving as a prod safety net behind the
Cloudflare Worker. Post-cutover the Worker has been the real data
source, and the static path was silently degrading rather than helping
(corpus.json is a generated artifact whose prose `details` are blank
in corpus-summary.json). This change:
- Stops emitting corpus.json in the publish-live workflow
- Removes the Worker-error fallback in getQuestionFullDetail — errors
now propagate to useFullQuestion and the UI shows a "details
unavailable" banner instead of silently filling blanks
- Drops the localhost auto-trigger in shouldUseStaticDetails — the
static path now requires explicit NEXT_PUBLIC_VAULT_FALLBACK=static
- Switches taxonomy.ts to corpus-summary.json (was corpus.json)
- Rewrites the publish-live smoke tests against corpus-summary.json
- Collapses validate-vault.py to sparse-only (per-question deep
validation lives in `vault check --strict`)
Static-fallback remains as an OPT-IN local-dev affordance: set
NEXT_PUBLIC_VAULT_FALLBACK=static and run `vault build --legacy-json`
to materialize corpus.json. The Function-constructor dynamic import
keeps Turbopack from requiring corpus.json at build time.
useFullQuestion hook signature changed from `Question | undefined` to
`{ question, status }`. Callers updated: practice and plans pages
(both render an amber "details unavailable" banner when status
is 'error').
Deleted dead cutover scaffolding: corpus-source.ts (router with no UI
consumers), corpus-vault.ts (worker-only mirror, never wired up),
useVaultQuestion.ts (unused migration hook), vault-fallback.ts (only
consumer was corpus-source.ts).
Deleted stale docs: staffml/scripts/DEPRECATED.md, vault-cli/docs/
CUTOVER_QA.md, three vault/docs/RESUME_PLAN_*.md.
Verified locally: tsc clean, vitest 37/37, next build produces all
15 static routes.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to fix(ci): staffml validate output paths for trailingSlash
builds. The E2E smoke test in interviews/staffml/scripts/e2e-smoke.py
hit /practice.html, /plans.html, /gauntlet.html, /about.html,
/progress.html — all of which return 404 with trailingSlash: true.
Switch ROUTES + ASSERT_CONTAINS keys to the canonical /<page>/ form
that Next.js actually serves.
- Drop staffml-validate-vault from pre-commit: full per-question checks need
vault build --legacy-json; book and unrelated pre-commit runs no longer
fail on a missing gitignored corpus.json.
- validate-vault.py: sparse mode (taxonomy + manifest) when corpus.json is
absent; full path unchanged when the bundle exists locally or after build.
- staffml-validate-dev smoke job: install vault-cli, run vault build
--legacy-json before validate-vault and corpus invariants (same contract as
preview/publish), raise job timeout for the build step.
Closes the cleanup arc (A.1–A.10 in RESUME_PLAN_RELEASE.md). Every
gate is now green: vault check --strict, vault lint, vault doctor,
vault codegen --check, staffml validate-vault, Playwright (9/9), tsc.
A.1 mobile-1962.svg: renamed `Edge` → `RegEdge` in graphviz source
(`Edge` is a reserved keyword); SVG renders cleanly. Also fixed
tinyml-1570.py (missing `import numpy as np`) which the new failure
log surfaced.
A.2 render_visuals.py: structured per-ID failure log written to
`_validation_results/render_failures.json` on every run; non-zero
exit on any per-item crash; new `--fail-fast` and `--failure-log`
CLI options. Replaces the prior silent-failure mode.
A.3 LinkML visual schema: typed as a structured sub-schema. New
`VisualKind` enum (svg only — `mermaid` was reserved but never
shipped, dropped to keep the enum honest). Path regex tightened
to `^[a-z0-9-]+\.svg$`. Alt minimum length 10, caption required
minimum length 5. TypeScript Visual interface + Question.visual
field added to staffml-vault-types/index.ts.
A.4 Pydantic Visual + Question validators:
- Visual.kind hard-rejects anything but `svg`
- Visual.path enforces the new regex
- Visual.alt min 10 chars, caption required min 5 chars
- Question.model_validator: visual.path MUST resolve to a real
file under interviews/vault/visuals/<track>/. Skipped in
production deploys where the working tree is absent.
A.5 Registry repair + doctor split:
- tools: repair_registry.py appended 5,269 missing IDs
(the rename refactor at 8a5c3ff3c left the append-only registry
unsynced; this brings disk-coverage to 100%). Header block in
id-registry.yaml documents the rebuild rationale.
- doctor.py: split symmetric `registry-integrity` check into
`disk-coverage` (HARD FAIL if any disk YAML id is unregistered)
and `registry-history` (INFO ONLY for retired ids — the registry
is by design an audit log, retired ids are normal). Pre-existing
`_check_schema_version` bug (`versions == {1}` vs string `"1.0"`)
fixed.
A.6 Lint calibration via 4-expert consensus + bloom-canonical
reclassification:
- Spawned 4 experts (Vijay Reddi, Chip Huyen, Jeff Dean,
education-reviewer) on 42 disputed (zone, level) pairs;
consensus-builder aggregated to 15 valid / 19 invalid / 8
borderline.
- User arbitrated 8 borderlines: 7 widen / 1 reclassify.
- Built ZONE_BLOOM_AFFINITY matrix (Education-Reviewer's idea):
every zone admits its dominant Bloom verb + adjacent verbs,
rejects clear hierarchy violations.
- reclassify_zone_bloom_mismatch.py applied 576 deterministic
zone fixes via BLOOM_CANONICAL_ZONE mapping (e.g. fluency+analyze
→ analyze, recall+analyze → analyze, evaluation+apply → implement).
- Question.model_validator(_zone_bloom_compatible): hard-rejects
future zone-bloom mismatches at write time. Generated drafts
can no longer ship a self-contradicting classification.
- ZONE_LEVEL_AFFINITY widened per consensus + arbitration +
post-reclassification adjustments. Lint warnings: 1,308 → 0.
A.7 Chain integrity:
- repair_chains.py: drops chain refs when a chain has <2 published
members (chain ceases to exist), renumbers all members of any
chain whose positions are non-sequential / duplicated /
non-monotonic-by-level. Sort key: level ascending, then old
position, then qid (deterministic).
- validate-vault.py: relaxed sequential check to unique-positions
check. Position gaps from mid-chain deletions are normal; what
matters is uniqueness + bloom-monotonicity (vault check --strict
enforces both from YAML source-of-truth).
A.8 Practice page visual + zoom modal:
- QuestionVisual.tsx: wraps the `<img>` in `<Zoom>` from
react-medium-image-zoom (4 KB). Click image → fullscreen
`<dialog data-rmiz-modal>`; ESC closes. Added test-id
`question-visual-img` for stable selector.
- New Playwright test: 9th in the suite, deep-links cloud-4492,
asserts the dialog opens on click and closes on ESC.
- TypeScript: removed `mermaid` from local Visual types in
corpus.ts and corpus-vault.ts; tsc clean.
A.9 All gates green:
- vault check --strict: 0 errors / 0 invariant failures
- vault lint: 0 errors / 0 warnings (was 1,308 warnings)
- vault codegen --check: artifacts in sync (hash baseline updated)
- vault doctor: 0 fails (registry-history info, git-state warn
on uncommitted state-pre-this-commit)
- staffml validate-vault: 0 errors / 0 warnings, deployment-ready
- Playwright: 9/9 pass (was 8; +zoom modal test)
- render_visuals: 0 errors (was 2 silent failures pre-A.2)
- tsc: clean
Distribution after reclassification: 9,544 published unchanged;
576 items moved zone via bloom-canonical mapping (full per-item
report at /tmp/reclassify_changes.csv). Chain count 879 → 850
after orphan-singleton drops. release_hash updated.
Carry-forward to next session (Phase B):
- Priority gap closure for parallelism cells + global L4-L6+
(the run that produced this corpus did not close the targeted
cells; B.3 needs specialized prompts per cell-class)
- 120 NEEDS_FIX items from coverage_loop/20260425_150712/ still
carry judge fix_suggestions; spawn fix-agent in Phase C
Two pre-release polish items.
1. Playwright E2E smoke in staffml-validate-dev
New 'e2e-smoke' job loads 6 critical routes (/, /practice, /plans,
/gauntlet, /about, /progress) in headless Chromium, asserts
HTTP 200 + zero uncaught page errors + zero console.errors
(allowlist for known Next.js static-export quirks + CF beacon
noise). Would have caught the hydration shape-mismatch bug
(PR #1440) before merge — the first symptom of that bug was a
console TypeError + white screen, both of which the smoke
detects.
interviews/staffml/scripts/e2e-smoke.py runs the probe: starts
python3 http.server from out/ on port 3000 (required by the
vault-worker CORS allowlist so hydration fetches succeed),
exercises the routes in sequence, reports per-route results,
exits non-zero on any failure.
CI job: fresh checkout -> npm ci -> vault build --legacy-json
(corpus regen) -> npm run build (no base-path for local serve)
-> playwright install chromium -> run script. ~4 min end-to-end.
summary job updated to depend on e2e-smoke and fail the workflow
if it errors.
2. validate-vault required by publish guard
staffml-publish-live.yml previously only gated on the latest
validate-dev run being green. validate-vault covers corpus
invariants (chain integrity, schema drift, taxonomy DAG) that
validate-dev doesn't, so a YAML-only PR that broke chain
integrity could slip through.
Added guard-vault job calling infra-publish-guard.yml with
validate_workflow: staffml-validate-vault.yml. build-and-deploy
now needs: [guard-dev, guard-vault].
Local dry-run of e2e-smoke.py passed all 6 routes with zero errors.
Removes the last active coupling between StaffML questions and the
mlsysbook.ai site:
Deleted files
=============
- interviews/staffml/src/data/chapter-urls.json
27-entry chapter-id → relative-path map. All 27 URLs currently 404
against production because the live site serves /contents/core/...
while the manifest uses /contents/vol1|vol2/... paths.
- interviews/staffml/scripts/check-deep-dive-links.py
Weekly URL-health probe that walked chapter-urls.json. Nothing else
consumes it; its sole SOURCE_PATH was the manifest above.
- .github/workflows/staffml-link-check.yml
Scheduled CI (cron '0 9 * * 1') + PR-comment + auto-issue-filing
pipeline for the probe. With the probe gone, the workflow had no
job left. Grep confirmed no other workflow depends on its
'staffml-link-report' artifact name.
Modified
========
- interviews/staffml/scripts/DEPRECATED.md
Drop the 'check-deep-dive-links.py' row (script no longer exists
so the replacement pointer is no longer meaningful).
- interviews/staffml/.gitignore
Drop the '_deep_dive_link_report.json' ignore (the file that
produced it is gone).
What replaces this
==================
Nothing yet. Per the resources-list model adopted in the preceding
commits, per-question book links are an author-curated editorial
act — authors add { name, url } entries to Details.resources when
book URLs stabilize (mlsysbook.ai/vol1 still moving). Until then,
StaffML is deliberately self-contained for book-linking purposes.
Ecosystem-level cross-linking to the book remains via Nav.tsx's
existing 'MLSysBook.ai' header link (stable, points at homepage);
a more prominent affordance is planned for a follow-up commit.
The probe was reading per-question `deep_dive_url` values from
src/data/corpus.json, but that field was removed during the vault
Phase-1 migration — every question now has `details.deep_dive_url=None`,
so the probe produced an empty 0-URL report.
Repoint at src/data/chapter-urls.json (the 27-entry chapter-id →
relative-path map that src/lib/refs.ts consumes), prefixed with
https://mlsysbook.ai. This is the correct StaffML→textbook link
surface until topic-granular linking ships (deferred; see
interviews/vault/BOOK_LINKING_PLAN.md).
Also add `pull-requests: write` to the workflow permissions. The
PR-comment step was 403'ing with "Resource not accessible by
integration" because GitHub requires `pull-requests: write` to
post comments on PRs (even though the underlying API is
`issues.createComment`).
Updates workflow `paths:` trigger: corpus.json → chapter-urls.json.
Note: the probe now correctly surfaces that chapter-urls.json paths
(`/contents/vol1/...`) drift from the current live site (`/contents/core/...`).
That drift is a real bug but is a book-linking concern, deferred to
the separate session tracked in BOOK_LINKING_PLAN.md.
v2.3 \u2192 v2.4. ARCHITECTURE.md header + Appendix reflect the completed
migration.
WHAT CLOSED (\u00a711.1 contract):
1. `vault build --legacy-json` regenerates the site's
interviews/staffml/src/data/corpus.json from YAML. 9,199 published
questions, site-compatible shape (chain_positions back to 0-indexed
dict form, bloom_level derived from zone, competency_area aliased
from topic, scope aliased from track). Deterministic via sort_keys +
id-sort.
2. Pre-commit hook INSTALLED via worktree-aware Makefile target
(`make -C interviews/vault-cli hooks`). Symlink points at
pre_commit_corpus_guard.py. Tested end-to-end: direct edit to
vault/corpus.json triggers exit-1 with §11.1 reference.
3. CI equivalence check added to .github/workflows/vault-ci.yml:
regenerates corpus.json from YAML, diffs against committed. Fails
PR on drift with actionable error message.
4. Legacy generators demoted with DEPRECATED headers:
- interviews/paper/scripts/analyze_corpus.py \u2192 vault export-paper
- interviews/staffml/scripts/sync-vault.py \u2192 vault build --legacy-json
- interviews/staffml/scripts/generate-manifest.py \u2192 vault publish
- interviews/vault/scripts/export_to_staffml.py \u2192 vault build --legacy-json
5. New DEPRECATED.md files at interviews/vault/scripts/ and
interviews/staffml/scripts/ map every legacy script to its
replacement. Both directories keep the old scripts for git-history
legibility and archaeology; new contributors see the vault CLI first.
6. ARCHITECTURE.md \u00a7Appendix rewritten as current-state table instead
of aspirational "gone. replaced by..." entries.
NEW TESTS (interviews/vault-cli/tests/test_legacy_export.py \u2014 +4):
- test_legacy_shape_matches_site_interface: every field corpus.ts
declares is present in regenerated JSON.
- test_chain_positions_legacy_shape: 1-indexed new schema \u2192
0-indexed legacy dict form.
- test_emitter_deterministic: byte-stable across reversed input order
(required for CI diff-check).
- test_competency_area_aliases_topic: legacy alias fields populated
correctly.
FULL MATRIX GREEN:
pytest: 38/38 passed in 0.19s (34 + 4 legacy-export)
ruff: All checks passed
hook: exit 0 on clean diff / exit 1 on corpus.json direct edit
e2e: vault build --legacy-json regenerates a bit-identical corpus.json
vs the committed one; CI check wired to catch drift
WHAT'S LEFT (deploy-gated, \u00a720.5 #1, #5, #6 partial, #8, #9):
- Production serves from D1: requires Phase-3 wrangler d1 create + deploy
- Manual QA per CUTOVER_QA.md: requires live staging
- Zero data loss D1-side verification: requires live D1
- 48h monitoring: requires production traffic
These are intrinsically user-action; the YAML-side migration is done.
The corpus has 4,159 deep_dive_url values across 1,004 unique URLs.
A baseline run today shows roughly half are dead (mlsysbook.ai chapter
routes return 404 even though the homepage links to them; the
harvard-edge.github.io dev mirror is fully retired). This script and
workflow surface the rot in CI so we catch regressions at PR time
instead of via user reports.
Script (interviews/staffml/scripts/check-deep-dive-links.py):
* Walks corpus.json and deduplicates the 4,159 references → 1,004
unique URLs to actually probe
* Probes via curl (more portable than urllib's macOS SSL chain),
HEAD with GET fallback, 6s timeout
* 8-worker thread pool keeps the full run under 2 minutes
* Aggregates into a JSON report at scripts/_deep_dive_link_report.json
(gitignored)
* Supports --hosts allowlist for targeted probing and --fail-on-broken
for CI integration
* Skips known-dead hosts (harvard-edge.github.io) without probing them
Workflow (.github/workflows/staffml-link-check.yml):
* Runs weekly (Mondays 09:00 UTC) and on PRs that touch corpus.json,
refs.ts, or the script itself
* Uploads the report as a 90-day artifact named staffml-link-report
* On PRs: posts a markdown table with healthy/broken counts and the
top-5 broken-by-impact URLs
* On scheduled runs: opens (or updates) a single tracking issue under
the staffml,link-health labels when health drops below 60%, with
deduplication so we don't spam new issues each week
* Uses concurrency: cancel-in-progress: false so scheduled runs always
finish even if a manual one is mid-flight
* Honors a manual fail_on_broken input from workflow_dispatch
YAML validates via yaml.safe_load. Script verified against the live
corpus on a small subset (arxiv.org + pytorch.org → 212/216 healthy)
and full corpus (488/1004 healthy → 48.8% baseline).
Wire the periodic-table YAML into staffml so the website has a
canonical view of the design space, with a shared sync script that
keeps the React data file derived from periodic-table/table.yml.
* scripts/sync-periodic-table.mjs — generator that reads
../../periodic-table/table.yml and writes
src/data/periodicTable.ts
* src/data/periodicTable.ts — generated TypeScript module with the
full element list (do not edit by hand; re-run the sync script)
* src/app/framework/page.tsx + PeriodicTable.module.css — new
/framework route that renders the table with role colors and
layer rows
* src/components/Nav.tsx — add "Framework" link with the Atom icon
* src/app/layout.tsx, globals.css, ThemeProvider.tsx — supporting
layout adjustments for the new route
* package.json + lockfile — minor dependency bumps
- Expose competency area filter on mobile via collapsible <details>
(was hidden lg:block, now accessible on all screen sizes)
- Preserve user's typed answer after reveal for side-by-side comparison
- Add aria-labels to search input, clear button, and scoring buttons
- Update sync-vault.py to use export_to_staffml.py pipeline
- Regenerate vault manifest (8,053 published questions)
- Add https:// validation on deep_dive_url hrefs
New vault questions use a richer schema: chain_ids as list, chain_positions
as dict mapping chain_id to position. Both generate-manifest.py and
validate-vault.py now handle both old (scalar) and new (list/dict) formats.
- validate-vault.py: schema checks, uniqueness, taxonomy consistency,
chain integrity, manifest sync, distribution sanity
- Runs in both dev preview and live publish workflows before deploy
- Errors block deployment, warnings are logged for review
- Disabled old interviews-preview-dev Quarto trigger (no longer exists)
- generate-manifest.py: auto-generates vault-manifest.json from corpus data
with version, content hash, question/chain/concept counts, distributions
- Auto-bumps patch version when content hash changes
- Changelog tracks question deltas between versions
- About page: shows vault version card (version, questions, chains, concepts)
- Footer: shows version number (v0.1.0)
- Workflow: run generate-manifest.py after any corpus update