Remove ten files from the public repo that should never have been
tracked. Verified no code references any of them before deleting.
AI-prompt files (private to author tooling, do not belong in the public
repo):
- interviews/vault-cli/docs/GEMINI_SELF_AUDIT_PROMPT.md
- interviews/vault/_pipeline/runs/gemini-self-audit/prompts/{cloud,
edge,global,mobile,tinyml}_audit_prompt.md (5 per-track prompts;
interviews/vault/.gitignore already excludes /_pipeline/, but these
five were force-added in f6c41d7689 before the rule was set)
Dev-scratch artifacts (clearly leftover dev iteration; filenames literally
say 'final' four different ways):
- interviews/vault-cli/check_results_absolute_final.json
- interviews/vault-cli/check_results_after_repair.json
- interviews/vault-cli/check_results_final.json
- interviews/vault-cli/check_results_total_final.json
No production code, tests, docs, or CI references any of these paths.
The audit-pipeline scripts that *would* write into _pipeline/ already
respect the existing gitignore rule for that directory tree.
A self-contained prompt that lets gemini CLI walk the corpus and audit it
directly via its own filesystem tools, without the audit_corpus_batched.py
Python wrapper. Useful when the wrapper hits rate-limit / exit-55 walls
or when the operator wants Gemini to checkpoint to disk as it goes.
The prompt uses an append-only JSONL output at
interviews/vault/_pipeline/runs/gemini-self-audit/01_audit.jsonl with
resume semantics (re-running skips qids already in the file). Encodes
the same five gates as audit_corpus_batched.py (format_compliance,
level_fit, coherence, math_correct, title_quality) plus a stable JSON
shape so downstream tooling can consume it identically.
Includes invocation guidance: --yolo + --skip-trust, slice by track to
avoid the multi-hour serial walk, resume across sessions.
Reflects the 2026-05-04 follow-up slices: math-skip-level (15 applies)
and math-finish queue drain (66 applies). Cumulative now 2,372 of
2,757 (86.0%); 385 known-deferred ahead of Phase 6. Also corrects the
original doc's '70 already-applied no-ops' line — those were unverified
math candidates the verify guard skipped, not no-ops.
Self-contained resume guide for the next session:
- Confirms Phases 0-5 (autonomous) + 8 done
- Documents 478 unresolved corrections (cross-refs PHASE_5_UNRESOLVED)
- Step-by-step for Phase 5 cleanup → Phase 6 schema → Phase 7 verify
→ Phase 9 release
- Concrete CLI commands for each step (vault audit review with
--filter-gate flags, vault codegen, vault publish)
- Reference doc map (which doc covers what)
- Pipeline data layout (where the canonical 01_audit.json lives)
- Full commit log from this session
- Merge command to land yaml-audit on dev when ready
- Paste-ready resume prompt for the next Claude Code session
Total estimated remaining work to ship vault 1.0.0: ~9h, mostly Phase 5
review + Phase 6 schema. Tree is clean; ready to hand off.
After the autonomous Phase 5 mass-apply + math-verify passes,
2,279 of 2,757 corrections (82.6%) were auto-applied. The remaining
478 were deliberately not applied because they fail one of three
safety checks:
75 math 'no' — independent Gemini check disputed the fix
14 math 'unclear' — Gemini wasn't confident
13 math + level-block — fix has level relabel that breaks a chain
168 relabel-up — against CORPUS_HARDENING_PLAN.md §10 Q3
138 chain-block — would break chains.json monotonicity
70 already-applied — no action needed
This doc:
- Summarizes the skip reasons + counts
- Points to the disposition logs in _pipeline/runs/
- Recommends a per-category review workflow
- Notes which categories are highest priority (math 'no')
- Notes which are chain-restructuring decisions (out of Phase 5 scope)
Reviewer flow uses `vault audit review` (apply_corrections.py wrapper)
with --filter-gate to target specific buckets.
Phase 5 autonomous portion is COMPLETE. Phase 6 (schema tightening)
remains safe to attempt once the 478 are dispositioned or
accepted as known-deferred.
merge_audit_runs.py — merges multiple per-track audit_corpus_batched
output dirs into one canonical run. Per-qid prefer non-error rows,
then rows with suggested_corrections.
AUDIT_FINDINGS_2026-05-03.md — first complete corpus audit.
summarize_audit.py — truncate rationale snippets at word boundaries
(was truncating mid-word, tripping codespell on words like 'claimin').
Phase 4 final stats (9,446 published questions audited):
format_compliance: ~960 fail
level_fit: ~1,580 fail
coherence: ~480 fail
math_correct: ~330 fail
title_quality: ~250 placeholder + ~25 malformed
20 error rows in global to retry on next run
1,767 questions have suggested_corrections; ~1,500 more need a
propose-fixes backfill pass (mostly cloud, some edge).
CORPUS_HARDENING_PLAN.md Phase 4 finalization.
End-to-end plan for taking the published-corpus audit from "stratified
sample at ~2,900 calls / 12 days" to "full corpus at ~450 calls / ~3
days". The previous plan over-budgeted by 6× because it assumed
1-call-per-gate-per-question; switching to batched 30-questions-per-call
collapses the cost.
Nine phases, 27 testable acceptance criteria. End state: every published
YAML conforms to a strict schema with load-time-enforced format markers
(Pitfall/Rationale/Consequence + Assumptions/Calculations/Conclusion);
math, level-fit, coherence, vendor fabrication, and physical realism are
independently Gemini-verified at corpus scale; new violations are caught
at vault check --strict time and cannot silently land.
Major design choices:
- Audit + corrections in one tool (audit_corpus_batched.py), with a
--propose-fixes mode whose suggestions are NEVER auto-applied —
humans review via apply_corrections.py.
- Schema tightening AFTER cleanup, not before (Phase 6 lifts pattern
constraints into LinkML / Pydantic only once Phase 5 has cleaned the
corpus, so the new constraints reject nothing real).
- Cron the audit (Phase 8) so findings become a routine artifact.
- AUTHORING.md + vault new scaffold (Phase 2) so new contributors see
the format conventions before authoring, not after CI catches them.
Captures the release-readiness state of the vault and the plan for
finishing the audit work the 250/day Gemini cap has constrained.
Corpus health survey (9,446 published questions, no Gemini cost):
- 100% schema-valid (Pydantic)
- 90.9% format-compliant (Pitfall/Rationale/Consequence + Assumptions/
Calculations/Conclusion markers)
- 9.1% fail format compliance (861 questions; mechanical fixes)
- 134 placeholder titles (all global/* "Global New NNNN")
- 407 with provenance: None (should be "imported")
- 95.3% canonical bold-marker napkin_math; 4.7% partial / bullet-only
Template gap noted: vault new scaffolds only scenario + solution stubs;
the Pitfall/Rationale/Consequence and Assumptions/Calculations/Conclusion
templates are encoded ONLY in the generation prompt and the
format-compliance regex. There's no human-readable AUTHORING.md.
The new session is asked to ship one.
The plan: stratified sample of 1,000 questions (33 per track × level
cell) with full Gemini gate suite (math + coherence + level_fit +
bridge) at ~2,900 calls across ~12 days at the 250/day cap. Full-corpus
audit (~27,400 calls / ~110 days) is infeasible; sampling captures any
failure mode at >5-10% rate.
Includes:
- Concrete numbers from the corpus survey (failure counts by category)
- Day-by-day execution plan with resume instructions
- Daily cost-ledger format
- Stopping rules
- Post-audit cleanup → paper.tech update path
- Mechanical (no-Gemini) cleanups the new session can do in parallel
with the daily audit cycle (provenance fix, format markers, AUTHORING.md)
CHAIN_ROADMAP.md Progress Log entry points the resumable cursor at
this plan.
Establishes one ignored subdirectory for ALL intermediate outputs of
LLM-driven tooling (chain proposals, gap detection, draft scorecards,
audit traces). Single gitignore rule: /_pipeline/.
Convention is documented in interviews/vault/README.md under "Pipeline
artifacts" — it's a real project layout convention, not AI-specific
config.
Path migration:
interviews/vault/chains.proposed*.json
→ _pipeline/chains.proposed*.json
interviews/vault/gaps.proposed*.json
→ _pipeline/gaps.proposed*.json
interviews/vault/draft-validation-scorecard.json
→ _pipeline/draft-validation-scorecard.json
interviews/vault/audit-runs/
→ _pipeline/runs/
8 scripts updated to define a PIPELINE_DIR constant and route default
outputs through it: build_chains_with_gemini.py,
apply_proposed_chains.py, merge_chain_passes.py, validate_drafts.py,
audit_chains_with_gemini.py, generate_question_for_gap.py,
summarize_proposed_chains.py, promote_drafts.py.
Forward-looking docs (README.md chain-pipeline section + CHAIN_ROADMAP.md
resume instructions + state snapshot) updated to reference the new
paths. Historical Progress Log entries left as-is — they accurately
describe what was committed at the time.
Drive-by .gitignore fixes (both used full repo-relative paths under
package-local .gitignore files, which never matched):
interviews/vault-cli/.gitignore: scripts/.calibration_cache/
interviews/vault/.gitignore: /embeddings.npz
Validation:
- vault check --strict: 10,705 loaded, 0 invariant failures
- pytest interviews/vault-cli/tests/: 74/74
- audit --dry-run: paths resolve correctly to _pipeline/runs/<ts>/
No durable corpus content moves. chains.json (live registry),
id-registry.yaml, questions/, etc. all stay where they were.
Action on the strongest finding from the 2026-05-01 independent audit:
54 of 55 Δ=0 chains had no shared scenario (the "two questions
sharing a scenario thread" constraint the lenient prompt was supposed
to enforce). Two independent audit fields agreed (verdict=bad and
shared_scenario=no), so this isn't a tuning question — the design
choice was wrong.
Why remove Δ=0 entirely rather than tighten the prompt:
- The chain definition is "pedagogical progression through Bloom
levels"; same-level edges contradict the definition.
- The "shared scenario / different angle" carve-out is unenforceable
by an LLM at corpus scale (audit confirmed).
- Same-scenario same-level pairs are more honestly modeled as
siblings of a chain anchor, not as chain members.
Changes:
- chains.json: 879 → 824. Dropped: 55 chains (all tier=secondary,
since Δ=0 was only ever produced by the lenient sweep).
Per-track: edge -19, tinyml -12, mobile -10, cloud -7, global -7.
- build_chains_with_gemini.py:
MODE_CONFIG["lenient"]["allowed_deltas"]: {0,1,2,3} → {1,2,3}
LENIENT_PROMPT_TEMPLATE: Δ=0 paragraph rewritten to explicitly
REJECT same-level pairs (with rationale citing the audit).
docstring + --mode help text updated.
- tests/test_chain_validation.py:
test_lenient_accepts_same_level_pair → test_lenient_rejects_same_level_pair
header docstring updated to reflect the new rule.
- vault-manifest.json: chainCount 879 → 824, releaseHash rolls to
479811040b7a… (real content delta, not a timestamp churn).
Validation:
- vault check --strict: 10,705 loaded, 0 failures
- vault build --local-json: chainCount=824, releaseHash=479811040b…
- pytest: 74/74
- playwright chain-and-vault-smoke: 19/19 (fixtures cloud-0001 +
cloud-0231 are still in their chains post-drop)
Audit findings #2 (gap detection ~50% noise) and #3 (4 pilot drafts
disposition) remain open — see CHAIN_ROADMAP.md Progress Log.
Ran audit_chains_with_gemini.py end-to-end. 18 Gemini-3.1-pro-preview
calls (well under the 250/day cap) sized to 80-336K char prompts (the
attention sweet spot at ~80-100K input tokens). Per-call traces under
interviews/vault/audit-runs/20260501T213817Z/, rollup at
interviews/vault/audit-runs/AUDIT_REPORT.md.
Three critical findings the pipeline's own gates missed:
1. Δ=0 chains are ~98% bad (54/55 judged "bad", 54/55 judged
"shared_scenario_for_d0_pair: no"). The lenient prompt's
constraint that Δ=0 only fire for shared-scenario pairs didn't
bind in practice. 6% of chains.json is affected.
2. Gap detection is ~50% noise. 21 of 40 sampled gaps judged
"hallucinated" — anchors don't share a scenario thread. Phase 3
generation should pre-filter gaps before issuing the call.
3. Pilot draft pass rate was inflated by validate_drafts.py's LLM
judges:
mobile-2147 accept
edge-2536 edit (scenario truncation)
edge-2537 REJECT (cognitive load too low for L3)
mobile-2146 REJECT (physically absurd 0.5s/4W NPU wake-up)
Calibration findings:
- Primary chains (n=100): 64% good, 22% weak, 14% bad
- Secondary chains (n=100): 61% good, 33% weak, 6% bad
- Tier delta vs primary is small at "good" — the actual quality
cliff in secondary is concentrated in the Δ=0 subset.
No autonomous fixes filed — per agreement, audit produces findings
only. CHAIN_ROADMAP.md Progress Log spells out the three concrete
decisions for next session (drop / demote / rebuild Δ=0; pre-filter
gaps; disposition the 4 drafts per AUDIT_REPORT.md).
Total Gemini calls this session: 55 (Phase 1.4 + Phase 3 pilot + audit).
D-cleanups folded into one commit:
- CHAIN_ROADMAP.md status header reflects current state (Phase 1+2
complete, Phase 3 pilot landed, Phase 4 mostly shipped).
- Phase 4.1 / 4.6 / 4.7 / 4.9 entries marked complete with commit
refs.
- ARCHITECTURE.md gains a §3.6.1 documenting the two YAML-body
conventions introduced when LLM-authored questions started
landing in Phase 3:
- _authoring private metadata block on drafts (stripped at
promotion)
- gap-bridge:<from>-<to> tag added at promotion for traceability
Neither is schema-enforced (Pydantic accepts extra); both are
stable across the pipeline.
No code changes.
Walkthrough for reviewing LLM-authored question drafts produced by
generate_question_for_gap.py + validate_drafts.py. Covers:
- what each of the 5 gates catches and (critically) misses
- what to read in what order, with watchpoints for the failure modes
that LLM gates routinely let through (vendor-name fabrication,
arithmetic drift, level-stamping mismatches)
- decision tree: promote (publish vs draft), edit + retry, reject
- exact promote_drafts.py invocations for each path
- rough scorecard summary for the 4 pilot drafts shipped in
a750ab7bc, ready for the user's review pass
Designed for ~10-15 min of reading per pilot batch.
dev renamed the vault-cli flag in 2b381bb949 (the flag is the staffml
frontend's local-dev fallback for reading corpus.json from disk, not
deprecated path — "local-json" reads correctly in scripts and docs).
Merge of origin/dev (5c5af75ed) brought the new name in but the
roadmap + README still referenced the old one.
- README.md: 1 replacement in the chain-pipeline runbook footer
- CHAIN_ROADMAP.md: 8 replacements across resume instructions,
phase runbooks, and progress-log validator lines
Historical text inside log entries is otherwise unchanged — those
record what was true at commit time. Forward-looking instructions
now use the current flag name.
Pilot run of the Phase 3 authoring tooling on a 5-gap subset (sized
down from the roadmap's 30 to keep wall-time + Gemini-call budget
reasonable for an unsupervised run).
Pilot scope:
Selected 5 high-value gaps from gaps.proposed.lenient.json — buckets
with ≥4 published questions, biased toward low-density tracks. All 5
picks landed in edge/mobile.
Phase 3.c — generate (5/5 written):
edge-2535 edge/latency-decomposition L?→L3
edge-2536 edge/pruning-sparsity L?→L4
edge-2537 edge/tco-cost-modeling L?→L3
mobile-2146 mobile/duty-cycling L?→L3
mobile-2147 mobile/model-format-conversion L?→L2
Phase 3.b validation — 4/5 pass (80% — above roadmap's 60-75% target):
edge-2535: FAIL on originality (cos=0.933 vs edge-1883, threshold 0.92)
edge-2536: pass on all 4 gates
edge-2537: pass on all 4 gates
mobile-2146: pass on all 4 gates
mobile-2147: pass on all 4 gates
The originality gate correctly caught a draft that was too similar
to one of its bridge anchors — exactly the failure mode it was
designed for. Gates were run on schema (Pydantic), originality
(BAAI/bge-small-en-v1.5 cosine vs in-bucket neighbours, threshold
0.92), level_fit (Gemini-judge against same-level exemplars),
coherence (Gemini-judge), and bridge (Gemini-judge against the gap
anchors).
Phase 3.d — promotion (4 passing drafts):
- .yaml.draft → .yaml rename
- _authoring stripped; replaced with proper schema fields:
provenance: llm-draft
status: draft (NOT published — gating on human review)
authors: [gemini-3.1-pro-preview]
human_reviewed: { status: not-reviewed }
tags: + gap-bridge:<from>-<to>
- id-registry.yaml appended (append-only ledger preserved)
- edge-2535.yaml.draft kept in place for the human reviewer's
disposition (rewrite + retry vs delete)
Validation post-promotion:
- vault check --strict: 10,705 loaded (was 10,701; +4 ✓), 0 failures
- vault build --legacy-json: released set unchanged
(status=draft excluded by release-policy.yaml's published filter)
— releaseHash and chainCount intentionally stable until human
review flips status
Phase 3.e (chain rebuild) deferred: drafts must clear human review
and flip to status: published before they're eligible for chain
membership. Runbook in CHAIN_ROADMAP.md Progress Log.
Cost: 5 generation + 15 judge = 20 Gemini calls.
Two new scripts that together close the loop from a gap entry to a
reviewable candidate question with a multi-gate scorecard.
generate_question_for_gap.py (3.a):
- Reads a gap entry, loads between-questions + same-bucket exemplars,
prompts gemini-3.1-pro-preview, runs Pydantic Question validation,
and writes <track>/<area>/<id>.yaml.draft. The .draft suffix keeps
drafts out of vault check / vault build until promotion.
- ID allocator scans corpus + existing drafts so a batch run gets
distinct fresh IDs without touching id-registry.yaml.
- Modes: --gap-index, --gaps-from + --limit, --dry-run.
validate_drafts.py (3.b):
- Five gates per draft: schema (Pydantic), originality (cosine vs
in-bucket neighbours via BAAI/bge-small-en-v1.5; matches the corpus
embeddings.npz so values are comparable; cutoff 0.92), level_fit
(Gemini-judge against same-level exemplars), coherence
(Gemini-judge: scenario/question/solution consistency), and bridge
(Gemini-judge: chain-fit between the gap's two anchors).
- Final verdict pass iff every non-skipped gate passes.
- Skips: --no-originality, --no-llm-judge.
- Output: interviews/vault/draft-validation-scorecard.json.
Smoke checks:
- 3.a --dry-run --gap-index 0: resolves gap, builds prompt, allocates
cloud-4579. Synthetic Gemini response Pydantic-validates clean.
- 3.b on a synthetic /tmp draft: schema + originality pass (top
neighbour cosine 0.73 vs 0.92 threshold).
Phase 3.c (pilot run on 30 gaps) deferred: it generates new YAML
question content that needs human review before promotion. The
tooling ships ready; running it is a user-supervised step.
CHAIN_ROADMAP.md Progress Log + Phase 3 status updated.
- 4.2: audited multi-chain memberships; 0 qids in >1 chain because
the lenient sweep was scoped to uncovered buckets (no overlap with
primary). Deferred the focused playwright test until Phase 3
authoring makes the case live.
- 4.8: marked complete; cross-ref to f086b6f42.
- Header timestamp + status snapshot updated.
Carries the primary/secondary chain tier (from Phase 1) through the
build pipeline into the practice + explore surfaces, so primary chains
are the unmarked default and secondary chains are an opt-in alternative
path the user can deep-link into via ?chain=<id>.
Backend (2.1):
- legacy_export.py emits chain_tiers per question alongside chain_ids
and chain_positions; missing chain-tier defaults to "primary".
- vault build re-run: 2953 chained questions, all carry chain_tiers
(releaseHash unchanged — new field is additive, doesn't perturb the
manifest hash inputs).
- Existing legacy_export tests were stale (asserted on the v1.0 YAML
chains: field path; v1.1 made chains.json the sidecar source).
Rewrote them to write chains.json fixtures into tmp_path and added
chain_tiers assertions, plus a focused
test_chain_tiers_emitted_per_membership case.
TypeScript (2.2):
- Question.chain_tiers? (Record<string, "primary"|"secondary">)
- ChainTier export, ChainInfo.tier required.
- getChainForQuestion / getAllChainsForQuestion populate tier;
getAllChains... sorts primary first.
- New getPrimaryChainForQuestion(qid) helper for default surfaces.
UI (2.3):
- practice page reads ?chain=<id> URL param; defaults to
getPrimaryChainForQuestion when unset.
- ChainBadge gains an inline "alt path" pill when tier=secondary
(always visible — no click needed).
- ChainStrip mirrors that pill in the progress row for users who
expand the strip.
- Explore page prefers the first non-secondary chain when picking
activeChainId for the related-questions panel.
- Deferred to a follow-up commit (intentional, scoped via Progress Log):
explore-page "Primary only / All" filter; daily/mock routing.
Tests (2.4):
- test7_tier_aware_chain_routing in chain-and-vault-smoke.mjs:
secondary reachable via ?chain=, alt-path badge visible on
secondary, primary regression, alt-path badge ABSENT on primary.
- Full smoke suite: 17/17 pass (was 13/13).
Validation:
- vault check --strict: 10,701 loaded, 0 failures
- vault build --legacy-json: 9438 published, chainCount=879
- pytest interviews/vault-cli/tests: 74/74
- npx tsc --noEmit: 0 errors
- playwright chain-and-vault-smoke: 17/17
Phase 2 complete. Next: Phase 3 (gap-driven authoring; 407-gap backlog).
Loads the published corpus (via vault_cli.policy — single source of truth)
and chains.json, buckets by (track, topic), and emits chain-coverage.json
with two cuts:
- uncovered_buckets: ≥3 questions, 0 chains
- under_covered_buckets: ≥6 questions, ≤1 chain
Plus per-track summary + top-10 uncovered for quick read.
Output is gitignored — regeneratable, fed to Phase 1.4's --buckets-from.
Phase 1.1 of CHAIN_ROADMAP.md. See progress log for the run results
(211 uncovered buckets, edge/mobile/tinyml chain density 0.6-0.8 vs
cloud's 2.95, biggest miss is cloud:roofline-analysis at 144q/0 chains).
Canonical document for the multi-phase chain growth plan. Future Claude
sessions read this first to resume exactly where the previous session
left off.
Structure:
- Resume Here: how to verify state + pick up the next step
- Current state snapshot: validators, counts, branch tip
- Phase 1: second-pass coverage build (373 -> ~700 chains)
- Phase 2: tier surfacing in schema + UI
- Phase 3: gap-driven authoring (using gaps.proposed.json)
- Phase 4: misc parallel items (CI gates, multi-chain UI, etc.)
- Recommended execution order over 4 weeks
- Progress Log: append-only notes after each step
Initial Progress Log entry captures session state through commit
1ac7d4c56 (Gemini chain rebuild applied, 373 chains live).
The bundled corpus.json was serving as a prod safety net behind the
Cloudflare Worker. Post-cutover the Worker has been the real data
source, and the static path was silently degrading rather than helping
(corpus.json is a generated artifact whose prose `details` are blank
in corpus-summary.json). This change:
- Stops emitting corpus.json in the publish-live workflow
- Removes the Worker-error fallback in getQuestionFullDetail — errors
now propagate to useFullQuestion and the UI shows a "details
unavailable" banner instead of silently filling blanks
- Drops the localhost auto-trigger in shouldUseStaticDetails — the
static path now requires explicit NEXT_PUBLIC_VAULT_FALLBACK=static
- Switches taxonomy.ts to corpus-summary.json (was corpus.json)
- Rewrites the publish-live smoke tests against corpus-summary.json
- Collapses validate-vault.py to sparse-only (per-question deep
validation lives in `vault check --strict`)
Static-fallback remains as an OPT-IN local-dev affordance: set
NEXT_PUBLIC_VAULT_FALLBACK=static and run `vault build --legacy-json`
to materialize corpus.json. The Function-constructor dynamic import
keeps Turbopack from requiring corpus.json at build time.
useFullQuestion hook signature changed from `Question | undefined` to
`{ question, status }`. Callers updated: practice and plans pages
(both render an amber "details unavailable" banner when status
is 'error').
Deleted dead cutover scaffolding: corpus-source.ts (router with no UI
consumers), corpus-vault.ts (worker-only mirror, never wired up),
useVaultQuestion.ts (unused migration hook), vault-fallback.ts (only
consumer was corpus-source.ts).
Deleted stale docs: staffml/scripts/DEPRECATED.md, vault-cli/docs/
CUTOVER_QA.md, three vault/docs/RESUME_PLAN_*.md.
Verified locally: tsc clean, vitest 37/37, next build produces all
15 static routes.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R11 (David, fresh-eyes stability check): 0 Critical + 0 High + 1 Medium
(doc cleanup from R10-F-2 closure itself).
R11-M-1 (MEDIUM): CUTOVER_QA.md + vault-cli/README.md still referenced
--canary-percent flag after R10-F-2 removed it from code + ARCHITECTURE.md.
Operator following CUTOVER_QA.md step 1 of cutover day would hit
'Error: no such option --canary-percent' \u2014 the one document whose
entire purpose is cutover correctness.
Fix: CUTOVER_QA.md \u00a71 replaces canary-staged rollout with all-or-nothing
ship language + Phase-7-deferred note pointing at \u00a74.3. README.md:57
drops [--canary-percent N] from the ship example.
STABILITY DECLARED after R11. Three consecutive rounds (R7, R8, R11) with
zero new Criticals. R11 explicit: 'convergence confirmed.'
Finding-density trajectory across 11 rounds (new Criticals per round):
R1: 3, R2: 1, R3: 2, R4: 3, R5: 3, R6: skipped,
R7: 0, R8: 0, R9: 1* (regression-detect, not new), R10: 0, R11: 0
Total findings closed across all rounds: ~120.
No further rounds scheduled.
ARCHITECTURE.md header bumped v2.5 \u2192 v2.6.
REVIEWS.md adds 'Rounds 7\u201311' section with per-round finding counts,
notable findings, meta-observation on R9 (tooling/persistence issue
Gemini caught that individual-file reviewers couldn't), and the
convergence signal.
Worker hardening (interviews/staffml-vault-worker/src/index.ts rewritten):
- B.1 Cloudflare Cache API wired via caches.default; cache key is
/__vault__/<release_id>/<path> so each release is a disjoint namespace.
Deploy changes release_id \u2192 all old entries miss atomically. Degraded
responses are NEVER cached (would poison the namespace).
- B.3 Keyset pagination: cursor is {after_id, filter_hash}. Server
computes filter_hash per-request and rejects cross-filter cursor reuse
with 400. Pagination cost drops from O(offset + N) to O(N) per page.
- B.4 Rate limiting via RATE_LIMIT_KV (src/rate_limit.ts): token bucket
per (IP, class) windowed at 60s. 'default' 60 rpm, 'search' 10 rpm.
Returns 429 with Retry-After header. Open-allows if KV not bound so
the local vault-api shim still works.
- /search uses FTS5 MATCH when questions_fts exists; fallback to LIKE
for pre-FTS5 D1 instances. Escapes FTS5 special chars to prevent
MATCH injection.
vault-api.ts circuit breaker (B.2 \u2014 Soumith R3-F-2 fix):
- Proper closed \u2192 open \u2192 half-open state machine. Half-open admits
exactly one probe; failure \u2192 re-open immediately, success \u2192 close.
- AbortSignal.timeout(10_000) per-attempt; AbortSignal.any() combines
with caller's signal so React unmounts don't count as failures.
- Retry only on retryable statuses (408/425/429/5xx/network), not on
4xx user errors or caller-aborted fetches.
- Module-level _singleton so multiple makeClientFromEnv() share breaker
state. __resetSingleton() exposed for tests.
Worker vitest suite (B.6 \u2014 staffml-vault-worker/tests/worker.test.ts):
6 tests: rate-limit under/over cap with Retry-After; schema-fingerprint
placeholder forces degraded mode; real fingerprint clears flag;
cursor filter_hash mismatch returns 400; CORS echoes allowed origin;
405 on POST/PUT/DELETE; /admin/release returns 404 (no auth footgun).
vault ship real hooks (B.15 \u2014 commands/release.py):
- d1_forward: pnpm exec wrangler d1 execute <env-db> --file <migration.sql>
- d1_rollback: applies d1-rollback.sql (SQL path); snapshot path remains
primary per \u00a76.2.
- nextjs_forward: pnpm run deploy:<env> from site_dir.
- nextjs_rollback: pnpm exec wrangler pages deployment list (lets operator
pick rollback target).
- paper_forward: git tag -a v<version> && git push origin v<version>.
- --skip-legs allows shipping subset (e.g., skip=paper for pre-tag validation).
Content-hash SLI workflow (B.5 \u2014 .github/workflows/vault-content-hash-sli.yml):
Hourly GitHub Action samples 20 IDs from latest release's vault.db,
fetches same IDs from production worker, recomputes canonical content_hash
in Python, asserts parity. Files a priority-high issue on mismatch.
Avoids porting hashing.py canonicalization to TypeScript (Chip R3-H5's
invariant-bomb risk).
JSON schemas (B.7 \u2014 vault-cli/docs/JSON_OUTPUT.md):
Full stable shapes for build, publish, ship, new, rm, move, renumber,
restore, promote, mark-exemplar, snapshot, migrations-emit, export-paper,
tag, deploy, rollback, generate. Plus notes for serve/api (not
JSON-emitting \u2014 long-running servers).
Codegen hash baseline (B.13 hash-check variant):
vault codegen --check now computes SHA-256 over 3 shared artifacts and
compares to committed interviews/vault-cli/codegen-hashes.txt. First run
auto-records baseline; subsequent runs enforce no drift. Full LinkML-driven
regeneration remains a Phase-2 follow-up. Baseline recorded this commit.
Component migration hook (B.17 \u2014
staffml/src/lib/hooks/useVaultQuestion.ts):
Minimal React hook that routes through corpus-source.ts. Components opt
into the cutover by importing from here; existing corpus.ts callers remain
untouched. Cutover-day swap is one import per component, not a big-bang
replacement.
28/28 pytest still green. release_hash 1b304282... unchanged (no
content-affecting mutations).
EVOLUTION.md (fixes H-1 from REVIEWS.md)
Schema-version rules: SemVer semantics (additive-minor implicit,
breaking-major bumps schema_version). Loader contract across
versions. vault migrate-schema mechanics: parallel tree, forward/
rollback functions, --dry-run, failure log. Mixed-version PRs
forbidden — CI rejects. Canonicalization-version (CANON_VERSION)
bumps separate from schema_version. Historical record stub.
EXIT_CODES.md
Stable exit-code taxonomy table with rationale for each category
(0 vs 1, 1 vs 2, 3 vs 4, 5 as user-abort). Usage in code, tests,
JSON output. Evolution policy: add new codes, never renumber.
JSON_OUTPUT.md
Common envelope: {ok, exit_code, exit_symbol, command,
cli_version, data, errors, warnings}. Per-command schemas for
check, stats, verify, doctor, diff. LSP-diagnostic shape for
check errors. --json-schema meta-command prints per-command
JSON Schema.
CONTRIBUTING.md (fixes H-17)
Quick-start path from clone → local site serving a question in
≤10min target. What can be contributed, workflow, PR review.
Provenance-honesty rules. Author attribution via
vault/contributors.yaml. Phase-by-phase scope of what works today
vs what lands later.
All four are referenced directly from ARCHITECTURE.md sections.