The flag is the StaffML frontend's local-dev fallback (read corpus.json from disk via NEXT_PUBLIC_VAULT_FALLBACK=static), not a deprecated path. "Legacy" implied "soon to be removed"; "local-json" describes its actual role and reads correctly in scripts and docs. - vault-cli: rename CLI flag, parameter, result key, and help text. - CI workflows + pre-commit config: invoke the new flag name. - All scripts that print the command (suggest_exemplars, pre_commit_corpus_guard, promote_validated, rename_legacy_ids, export_to_staffml, the paper analyze_corpus/generate_*) updated. - Comments and docs (ARCHITECTURE, CHANGELOG, REVIEWS, TESTING, MASSIVE_BUILD_RUNBOOK, DEPRECATED, AUTHORING, plus frontend comments and .env.example / .gitignore) updated. The "legacy_json" sentinel string in corpus_stats.json._meta.source is intentionally NOT renamed — it is a stable artifact format read by downstream paper-generation tooling.
19 KiB
StaffML Vault — Testing Plan
Scope: Full test strategy for
vault-cli, the Worker API, and the cutover flow. Expands ARCHITECTURE.md §19 with concrete test inventory, fixtures, CI workflow spec, and phase gates. Status: v1, drafted 2026-04-15 alongside ARCHITECTURE.md v2.1.
1. Test pyramid overview
| Layer | Runtime | Runs on | Blocks | Budget |
|---|---|---|---|---|
| Unit | pytest | every commit | PR merge | ≤5s |
| Integration | pytest | every commit | PR merge | ≤30s |
| Contract (CLI end-to-end) | pytest via Typer CliRunner | every commit | phase transition | ≤20s |
| Data-migration (corpus.json → YAML → vault.db) | pytest | once at Phase 1 exit | Phase 2 start | n/a |
| Equivalence (Merkle hash, site/paper parity) | CI job | every PR Phase 1–3 | PR merge | ≤120s |
| Shared-types drift | CI job | every PR post-Phase-1 | PR merge | ≤30s |
| Worker contract | pytest + miniflare | every commit Phase 3+ | phase transition | ≤60s |
| End-to-end (real site + staging D1) | Playwright | pre-cutover Phase 4 | production cutover | ≤10min |
| Smoke (50 IDs: worker vs direct DB) | vault smoke-test |
pre-every-prod-deploy | production deploy | ≤60s |
| Load (worker under realistic traffic) | k6 | pre-cutover Phase 4 | production cutover | ≤30min |
| Rollback (symmetry, drill) | pytest + manual | every publish + quarterly | every deploy | ≤15min |
| Data-plane SLI | Worker cron | continuously post-Phase-3 | n/a (alerts) | ≤5s/run |
CI wall-clock budget on a standard GitHub runner: ≤2 min for all "every PR" checks combined.
2. Test fixtures
2.1 Frozen 20-question test corpus (vault-cli/tests/fixtures/test-corpus/)
Structure mirrors production vault/questions/ exactly. Never changes unless the fixture bump is explicit and documented.
Coverage requirements:
- 5 tracks × 6 levels × subset of zones = 20 questions hitting every major (track, level, zone) combination.
- 3 chained sequences: one 2-deep, one 3-deep, one 4-deep — exercises chain position invariants.
- 1 deprecated question: exercises status-transition invariants.
- 1 LLM-draft + 1 llm-then-human-edited: exercises provenance invariants.
- 1 question in excluded applicability cell: deliberately included to assert the validator rejects it (negative fixture).
Fixture includes:
questions/*.yaml(20 files).taxonomy.yaml,chains.yaml,zones.yaml(frozen subsets referenced by the fixture).release-policy.yaml(frozen).id-registry.yaml(frozen append-only log).release-1.0.0/(expected snapshot artifacts for comparison).
2.2 Golden vault.db (vault-cli/tests/fixtures/golden/vault.db)
Expected SQLite output from running vault build against the frozen test corpus. Regenerated by vault-cli/tests/regenerate_golden.py when the canonicalization version bumps (rare, explicit, reviewed).
Primary integration test: built vault.db byte-compares against golden via Merkle release_hash equality — not raw SQLite byte-diff (SQLite is not byte-reproducible; see ARCHITECTURE.md §3.5).
2.3 Schema-drift fixtures (vault-cli/tests/fixtures/drift/)
Deliberately-broken YAMLs asserting each validator class catches its error:
missing-required-field.yaml→ fast invariant #8 (required fields).unknown-topic.yaml→ structural invariant #11 (taxonomy gate).broken-chain-ref.yaml→ structural invariant #12.chain-position-hole.yaml→ structural invariant #13 (contiguous[1..N]).wrong-schema-version.yaml→ loader rejects.yaml-billion-laughs.yaml→ YAML hardening (invariant #9).yaml-oversized.yaml(>256KB) → file-size cap.html-in-scenario.yaml→ content-format invariant #10.http-deep-dive-url.yaml→ URL scheme allowlist.case-mismatched-path.yaml→ path lowercase invariant #4.unknown-path-component.yaml→ path enum invariant #5.duplicate-id.yaml→ fast invariant #2 (unique IDs).id-path-mismatch.yaml→ registry/file consistency check.provenance-llm-without-meta.yaml→ invariant #18.exemplar-unreviewed-llm.yaml→ invariant #19.
Each drift fixture has an associated pytest asserting the exact error class raised.
2.4 Cross-release fixtures (vault-cli/tests/fixtures/releases/)
release-1.0.0-to-1.0.1/— content-only delta (modified scenarios, same schema) → exercisesmigrations emitand rollback symmetry.release-1.0.1-to-1.1.0/— schema-changing delta (new optional field) → exercises--schema-changegate and forward-only migration.
3. Test inventory per layer
3.1 Unit tests (vault-cli/tests/unit/)
Scope: pure functions, no I/O.
test_id_allocation.py: content-hash derivation, 6-hex prefix stability, dedup-seq logic, collision detection.test_canonical_hash.py: key-order invariance (nested dicts hash identically with reordered keys), Unicode NFC normalization, LF normalization, excluded-fields whitelist correctness.test_merkle.py: release_hash leaf construction,__policy__+__canon_version__inclusion, ordering stability.test_path_parser.py: lowercase enforcement, enum validation, extract-classification-from-path round-trip.test_chain_form.py: structured{id, position}accepted, legacy<id>@<pos>accepted on read (not write).test_provenance_enum.py: closed enum rejection of unknown values; generation_meta required iff not human.test_content_format.py: plaintext/markdown/URL rules per field; allowlist rendering.test_yaml_hardening.py: billion-laughs rejected, size cap, depth cap, alias rejection, timeout.test_exit_codes.py: each exit code returned by the correct failure mode.test_policy_predicate.py:release-policy.yamlfilter is the single source; import-graph check enforced.
3.2 Integration tests (vault-cli/tests/integration/)
Scope: multi-component, hits filesystem/SQLite but no network.
test_build_equivalence.py: YAML → vault.db produces matchingrelease_hashvs golden.test_validator_drift.py: each drift fixture raises the expected invariant-failure class.test_registry_append_only.py: a commit that removes a registry line is rejected byvault check.test_publish_atomicity.py: simulated mid-publish failure (step 5 kill) leaves.pending-<v>/,vault publish --resumecompletes cleanly.test_migrations_round_trip.py: forward-then-rollback on SQL path produces byte-identical pre-state (via content-hash, not binary diff).test_migrations_snapshot.py: pre-deploy-snapshot → forward migration → snapshot-restore →vault verifypasses.test_chain_integrity.py:vault rm --hardon chained question refuses;vault moveof single member of chain refuses.test_rename_on_move.py:vault movepreserves filename (topic+hash+seq); git follows rename.test_renumber_recovery.py: simulated post-rebase collision →vault renumberproduces valid state.test_scenario_dedup_lsh.py: MinHash buckets detect near-duplicates within LSH; embedding check catches templated-but-distinct.
3.3 CLI contract tests (vault-cli/tests/cli/)
Scope: end-to-end Typer CliRunner invocations.
test_vault_new.py: new question creates file at correct path, ID allocated, YAML validates.--count N --batchopens N drafts in$EDITOR. Validation failure injects error comment + stderr.test_vault_edit.py: edit persists changes, re-validates, exits 1 on invalid save.test_vault_rm.py: default soft-delete setsstatus: deprecated;--hardwithout typed confirmation refuses; typed confirmation hard-deletes.test_vault_move.py: reclassify viagit mv; refuses on dirty tree; refuses chain-breaking move.test_vault_restore.py: deprecated → published round-trip.test_vault_build.py: builds vault.db with correct release_hash;--local-jsonemits corpus.json.test_vault_check.py: fast-tier <1s, structural-tier <30s on fixture;--jsonemits LSP-diagnostic shapes.test_vault_publish.py: composed product equivalent to primitive sequence;.ship-journal.jsonabsent on pre-ship state.test_vault_ship_journal.py: simulated sub-step failure populates journal;--resumecontinues; paper-leg failure pages (mock alerting).test_vault_deploy.py: pre-deploy R2 snapshot synchronous; POP propagation probe blocks on unverified stale.test_vault_rollback.py: snapshot method restores to verified state; sql method works on content-only release.test_vault_verify.py: exit 0 on match, exit 1 on any divergence (id-registry, content-hash, release_hash).test_vault_generate.py: exemplar-pool enforcement (refuses <3 eligible); dry-run emits cost estimate without API call; cap--count ≤25; secrets file mode 0600 enforced; daily ledger refuses over-ceiling.test_vault_api.py: local shim mirrors Worker endpoint surface; schemas match shared-types codegen.test_vault_doctor.py: each subcheck runnable independently;--jsonemits stable schema.test_exit_code_taxonomy.py: each failure mode returns the correct exit code per §4.6.
3.4 Data-migration tests (one-time, Phase 1 exit gate)
test_corpus_json_split.py: currentcorpus.json→ per-question YAML produces set-equal IDs.test_content_hash_parity.py: per-question content_hash from YAML matches hash recomputed from original corpus.json fields (post-whitelist + canonicalization).test_chain_graph_isomorphism.py: chain structure in vault.db matches chains.json exactly.test_policy_filter_parity.py:{published, validated}predicate produces the same set as current paper'sanalyze_corpus.pyAND the same set as site's filter predicate — converged count (no more 9,199 vs 8,053).
3.5 Shared-types drift tests
test_codegen_drift.py: re-run LinkML codegen in tempdir, diff against committed@staffml/vault-types/,vault-cli/src/vault_cli/models/,releases/<latest>/schema.sql. Any diff → fail PR. CI never auto-fixes.
3.6 Worker contract tests (staffml-vault-worker/tests/)
Scope: Worker behavior under miniflare (Cloudflare Workers local dev).
test_endpoint_shapes.py: every endpoint's JSON matches@staffml/vault-typesschema.test_cursor_pagination.py: opaque cursors; stable order across calls; invalid cursor rejected.test_etag.py: ETag format"<release_id>:<resource>:<content_hash>"; 304 onIf-None-Match.test_cache_release_key.py: newrelease_idinrelease_metadatainvalidates all Cache API entries.test_schema_fingerprint.py: cold-start hashessqlite_master; mismatch →X-Vault-Degradedheader + Cache-API-only mode, not 5xx.test_grace_window.py: during 10-min post-deploy window, worker accepts both current and previous release_id.test_rate_limit.py: 60 req/min/IP on GETs, 10 req/min on/search.test_cors.py: production domains allowed, wildcard not.test_no_mutations.py: no public mutation endpoints; POST/PUT/DELETE to any non-admin path → 405.test_admin_removed.py:POST /admin/releasereturns 404.
3.7 End-to-end tests (Playwright, Phase 4)
Against staging site pointing at staging D1. Full flows:
e2e_practice.spec.ts: load → filter by track → reveal → navigate chain →AskInterviewertutor.e2e_gauntlet.spec.ts: start session → answer N → view post-mortem.e2e_progress.spec.ts: attempts persist; due count correct.e2e_landing.spec.ts: question count matches manifest; no 19MB JS chunk in network tab.e2e_about.spec.ts: "Read the paper" visible above fold; BibTeX rendered; release_hash in footer.e2e_search.spec.ts: ⌘K opens palette; debounce; results ranked; snippet highlights.e2e_offline.spec.ts: service worker serves last 200 questions after DevTools offline toggle.e2e_fallback.spec.ts:NEXT_PUBLIC_VAULT_FALLBACK=staticserves corpus.json fallback.e2e_rollback_drill.spec.ts: live site → flip fallback flag → site keeps serving (rehearsed pre-production).
3.8 Smoke tests (vault smoke-test, pre-every-deploy)
50 random question IDs per run: vault smoke-test --env <env> --samples 50.
For each ID:
- Worker response
GET /questions/:id. - Direct D1 query
SELECT * FROM questions WHERE id=?. - Direct vault.db query from
releases/<v>/vault.db.
Assert: all three return byte-identical JSON. Any divergence fails the smoke-test; deploy aborts.
Additional checks:
release_idconsistency across/manifest,/questions/:id, andrelease_metadata.- Cache API hit rate > 80% on repeat call (warm).
3.9 Load tests (k6, Phase 4 pre-cutover + pre-scale-milestones)
Scenarios against staging with production-scale corpus:
- Steady-state: 100 req/s sustained for 10 min.
- Burst: 500 req/s for 30s.
- Search-heavy: 20% of traffic is
/searchwith realistic query mix (60% term, 20% phrase, 20% multi-term). - Cold-path injection: 10% of requests hit non-warm POPs (via header manipulation).
Gates (ARCHITECTURE.md §10.6):
- p99 warm ≤ 100ms on
/search. - p99 cold ≤ 500ms.
- D1 row-reads/FTS5-query ≤ 500.
- 5xx rate = 0%.
- Cost projection ≤ $30/mo at 1K DAU scale.
3.10 Rollback tests
- Property test on every publish: for every release N,
apply(d1-migration.sql)+apply(d1-rollback.sql)produces state whoserelease_hashmatches pre-migration state. Runs in CI. - Snapshot-restore drill, quarterly: restore staging D1 from an R2 snapshot end-to-end. Timed. Target RTO ≤ 10 min decision-to-restored.
- Feature-flag rollback drill, pre-Phase-4: staging site with active service worker; flip
NEXT_PUBLIC_VAULT_FALLBACK=static; redeploy; verify SW evicts stale cache; site serves from inlined corpus.
3.11 Data-plane SLI probes (continuous, post-Phase-3)
Worker crons per ARCHITECTURE.md §10.5 table. Alert on any divergence.
- 5-min: row-count parity
vault.dbvs D1. - Hourly: 20-sample content_hash match.
- Hourly: FTS5 row count vs base table.
- Daily: provenance distribution, staleness, validation-failure rate on main.
- On deploy + hourly:
/manifestrelease_id propagation across 8 POPs.
4. CI workflow spec
4.1 .github/workflows/staffml-validate-vault.yml (every PR touching interviews/vault/ or interviews/vault-cli/)
name: '🎯 StaffML · ✅ Validate (Vault)'
on:
pull_request:
paths:
- 'interviews/vault/**'
- 'interviews/vault-cli/**'
- 'interviews/staffml-vault-worker/**'
- 'interviews/staffml/package.json'
jobs:
validate:
runs-on: ubuntu-latest # upgrade to larger-runner if budget exceeded
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.12' } # PINNED — hash stability
- run: pip install -e interviews/vault-cli/[dev]
- run: vault --version
# Fast + structural invariants (<60s budget)
- run: vault check --strict
timeout-minutes: 2
# Build + Merkle equivalence (<30s)
- run: vault build
- name: Compare release_hash vs corpus-equivalence-hash.txt
run: |
EXPECTED=$(cat interviews/vault/corpus-equivalence-hash.txt)
ACTUAL=$(vault stats --format json | jq -r .release_hash)
[ "$EXPECTED" = "$ACTUAL" ] || exit 1
# Shared-types codegen drift check
- run: vault codegen --check
# Registry append-only check
- run: |
git fetch origin main
python interviews/vault-cli/scripts/check_registry_append_only.py
# Unit + integration + CLI contract tests
- run: pytest interviews/vault-cli/tests/ -x --timeout=60
# Rollback symmetry (if this is a release commit on main)
- name: Rollback symmetry property test
if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
run: pytest interviews/vault-cli/tests/release/ --rollback-symmetry
Status check required for PR merge.
4.2 .github/workflows/vault-nightly.yml
name: vault-nightly
on:
schedule:
- cron: '0 8 * * *' # 08:00 UTC
jobs:
slow-checks:
runs-on: ubuntu-latest
steps:
- run: vault check --tier slow # link rot, LLM math, Pint units
- run: vault doctor --check-links # updates vault/link-rot.yaml
- run: vault stats --format prometheus > /tmp/metrics.prom
- run: curl -X POST "$PROMETHEUS_PUSHGATEWAY" --data-binary @/tmp/metrics.prom
data-plane-sli-sweep:
runs-on: ubuntu-latest
steps:
- run: vault smoke-test --env production --samples 100
4.3 .github/workflows/vault-worker-deploy.yml (invoked by vault ship)
on:
workflow_dispatch:
inputs: { version: {required: true}, env: {required: true} }
jobs:
deploy:
runs-on: ubuntu-latest
environment: ${{ inputs.env }} # gates production behind env-protection review
steps:
- run: vault verify ${{ inputs.version }}
- run: vault deploy ${{ inputs.version }} --env ${{ inputs.env }}
- run: vault smoke-test --env ${{ inputs.env }} --samples 50
5. Phase-entry gates
| Phase | Entry gate | Testing artifacts required |
|---|---|---|
| Phase 0 | nothing | — |
| Phase 1 | Phase 0 deliverables green | EVOLUTION.md committed; vault-cli/README.md; vault-cli skeleton + pytest green |
| Phase 2 | Phase 1 milestone: vault build release_hash matches equivalence-hash |
Data-migration tests green; drift fixtures cover all invariant classes |
| Phase 3 | License decision resolved (L-10) | Content-hash parity test green; policy-filter-parity test green; rollback-symmetry test green on test corpus |
| Phase 4 | FTS5 load-test gate: p99 warm ≤100ms, p99 cold ≤500ms, row-reads ≤500/query | k6 load test artifacts; worker contract tests green; smoke-test infrastructure deployed to staging |
| Phase 4 cutover | Lighthouse CI green on practice/gauntlet/landing; E2E suite green; rollback drill executed in staging | Lighthouse artifacts; Playwright reports; rollback-drill recording |
| Phase 5 | Phase 4 cutover stable 48h | Chain-badge instrumentation emitting events; SLI dashboards green |
| Phase 6 | Phase 5 shipped | About-page e2e test green |
6. Observability + rollback during Phase 4 rollout
See ARCHITECTURE.md §19.5 + §10.5. Summary:
- Transport: 5xx rate alert (>1% over 5 min), p99 latency alert (>500ms sustained), request anomaly.
- Data-plane: row-count parity, content-hash sampling, FTS5 parity, schema_fingerprint parity, manifest-release_id-propagation across POPs. Any divergence → red.
- Canary:
vault ship --canary-percent 10 → 50 → 100, soak =max(15 min, ≥100 sessions observed). - Rollback:
NEXT_PUBLIC_VAULT_FALLBACK=staticredeploy. Rehearsed on staging pre-cutover. - Dashboards: Cloudflare Analytics + Grafana (scraped from
vault stats --format prometheus). - Alerting: PagerDuty on any red SLI; Slack on yellow.
7. Living document
TESTING.md evolves with the CLI. Each new vault subcommand adds a test_vault_<cmd>.py file. Each new invariant in ARCHITECTURE.md §5 adds a drift fixture. PRs that add functionality without tests are CI-blocked by coverage minimum (initially 80%, ratcheted upward).
End of testing plan.