Files
cs249r_book/interviews/vault/TESTING.md
Vijay Janapa Reddi 2b381bb949 refactor(vault-cli): rename --legacy-json to --local-json
The flag is the StaffML frontend's local-dev fallback (read corpus.json
from disk via NEXT_PUBLIC_VAULT_FALLBACK=static), not a deprecated path.
"Legacy" implied "soon to be removed"; "local-json" describes its actual
role and reads correctly in scripts and docs.

- vault-cli: rename CLI flag, parameter, result key, and help text.
- CI workflows + pre-commit config: invoke the new flag name.
- All scripts that print the command (suggest_exemplars,
  pre_commit_corpus_guard, promote_validated, rename_legacy_ids,
  export_to_staffml, the paper analyze_corpus/generate_*) updated.
- Comments and docs (ARCHITECTURE, CHANGELOG, REVIEWS, TESTING,
  MASSIVE_BUILD_RUNBOOK, DEPRECATED, AUTHORING, plus frontend
  comments and .env.example / .gitignore) updated.

The "legacy_json" sentinel string in corpus_stats.json._meta.source
is intentionally NOT renamed — it is a stable artifact format read
by downstream paper-generation tooling.
2026-04-30 09:30:28 -04:00

19 KiB
Raw Permalink Blame History

StaffML Vault — Testing Plan

Scope: Full test strategy for vault-cli, the Worker API, and the cutover flow. Expands ARCHITECTURE.md §19 with concrete test inventory, fixtures, CI workflow spec, and phase gates. Status: v1, drafted 2026-04-15 alongside ARCHITECTURE.md v2.1.


1. Test pyramid overview

Layer Runtime Runs on Blocks Budget
Unit pytest every commit PR merge ≤5s
Integration pytest every commit PR merge ≤30s
Contract (CLI end-to-end) pytest via Typer CliRunner every commit phase transition ≤20s
Data-migration (corpus.json → YAML → vault.db) pytest once at Phase 1 exit Phase 2 start n/a
Equivalence (Merkle hash, site/paper parity) CI job every PR Phase 13 PR merge ≤120s
Shared-types drift CI job every PR post-Phase-1 PR merge ≤30s
Worker contract pytest + miniflare every commit Phase 3+ phase transition ≤60s
End-to-end (real site + staging D1) Playwright pre-cutover Phase 4 production cutover ≤10min
Smoke (50 IDs: worker vs direct DB) vault smoke-test pre-every-prod-deploy production deploy ≤60s
Load (worker under realistic traffic) k6 pre-cutover Phase 4 production cutover ≤30min
Rollback (symmetry, drill) pytest + manual every publish + quarterly every deploy ≤15min
Data-plane SLI Worker cron continuously post-Phase-3 n/a (alerts) ≤5s/run

CI wall-clock budget on a standard GitHub runner: ≤2 min for all "every PR" checks combined.


2. Test fixtures

2.1 Frozen 20-question test corpus (vault-cli/tests/fixtures/test-corpus/)

Structure mirrors production vault/questions/ exactly. Never changes unless the fixture bump is explicit and documented.

Coverage requirements:

  • 5 tracks × 6 levels × subset of zones = 20 questions hitting every major (track, level, zone) combination.
  • 3 chained sequences: one 2-deep, one 3-deep, one 4-deep — exercises chain position invariants.
  • 1 deprecated question: exercises status-transition invariants.
  • 1 LLM-draft + 1 llm-then-human-edited: exercises provenance invariants.
  • 1 question in excluded applicability cell: deliberately included to assert the validator rejects it (negative fixture).

Fixture includes:

  • questions/*.yaml (20 files).
  • taxonomy.yaml, chains.yaml, zones.yaml (frozen subsets referenced by the fixture).
  • release-policy.yaml (frozen).
  • id-registry.yaml (frozen append-only log).
  • release-1.0.0/ (expected snapshot artifacts for comparison).

2.2 Golden vault.db (vault-cli/tests/fixtures/golden/vault.db)

Expected SQLite output from running vault build against the frozen test corpus. Regenerated by vault-cli/tests/regenerate_golden.py when the canonicalization version bumps (rare, explicit, reviewed).

Primary integration test: built vault.db byte-compares against golden via Merkle release_hash equality — not raw SQLite byte-diff (SQLite is not byte-reproducible; see ARCHITECTURE.md §3.5).

2.3 Schema-drift fixtures (vault-cli/tests/fixtures/drift/)

Deliberately-broken YAMLs asserting each validator class catches its error:

  • missing-required-field.yaml → fast invariant #8 (required fields).
  • unknown-topic.yaml → structural invariant #11 (taxonomy gate).
  • broken-chain-ref.yaml → structural invariant #12.
  • chain-position-hole.yaml → structural invariant #13 (contiguous [1..N]).
  • wrong-schema-version.yaml → loader rejects.
  • yaml-billion-laughs.yaml → YAML hardening (invariant #9).
  • yaml-oversized.yaml (>256KB) → file-size cap.
  • html-in-scenario.yaml → content-format invariant #10.
  • http-deep-dive-url.yaml → URL scheme allowlist.
  • case-mismatched-path.yaml → path lowercase invariant #4.
  • unknown-path-component.yaml → path enum invariant #5.
  • duplicate-id.yaml → fast invariant #2 (unique IDs).
  • id-path-mismatch.yaml → registry/file consistency check.
  • provenance-llm-without-meta.yaml → invariant #18.
  • exemplar-unreviewed-llm.yaml → invariant #19.

Each drift fixture has an associated pytest asserting the exact error class raised.

2.4 Cross-release fixtures (vault-cli/tests/fixtures/releases/)

  • release-1.0.0-to-1.0.1/ — content-only delta (modified scenarios, same schema) → exercises migrations emit and rollback symmetry.
  • release-1.0.1-to-1.1.0/ — schema-changing delta (new optional field) → exercises --schema-change gate and forward-only migration.

3. Test inventory per layer

3.1 Unit tests (vault-cli/tests/unit/)

Scope: pure functions, no I/O.

  • test_id_allocation.py: content-hash derivation, 6-hex prefix stability, dedup-seq logic, collision detection.
  • test_canonical_hash.py: key-order invariance (nested dicts hash identically with reordered keys), Unicode NFC normalization, LF normalization, excluded-fields whitelist correctness.
  • test_merkle.py: release_hash leaf construction, __policy__ + __canon_version__ inclusion, ordering stability.
  • test_path_parser.py: lowercase enforcement, enum validation, extract-classification-from-path round-trip.
  • test_chain_form.py: structured {id, position} accepted, legacy <id>@<pos> accepted on read (not write).
  • test_provenance_enum.py: closed enum rejection of unknown values; generation_meta required iff not human.
  • test_content_format.py: plaintext/markdown/URL rules per field; allowlist rendering.
  • test_yaml_hardening.py: billion-laughs rejected, size cap, depth cap, alias rejection, timeout.
  • test_exit_codes.py: each exit code returned by the correct failure mode.
  • test_policy_predicate.py: release-policy.yaml filter is the single source; import-graph check enforced.

3.2 Integration tests (vault-cli/tests/integration/)

Scope: multi-component, hits filesystem/SQLite but no network.

  • test_build_equivalence.py: YAML → vault.db produces matching release_hash vs golden.
  • test_validator_drift.py: each drift fixture raises the expected invariant-failure class.
  • test_registry_append_only.py: a commit that removes a registry line is rejected by vault check.
  • test_publish_atomicity.py: simulated mid-publish failure (step 5 kill) leaves .pending-<v>/, vault publish --resume completes cleanly.
  • test_migrations_round_trip.py: forward-then-rollback on SQL path produces byte-identical pre-state (via content-hash, not binary diff).
  • test_migrations_snapshot.py: pre-deploy-snapshot → forward migration → snapshot-restore → vault verify passes.
  • test_chain_integrity.py: vault rm --hard on chained question refuses; vault move of single member of chain refuses.
  • test_rename_on_move.py: vault move preserves filename (topic+hash+seq); git follows rename.
  • test_renumber_recovery.py: simulated post-rebase collision → vault renumber produces valid state.
  • test_scenario_dedup_lsh.py: MinHash buckets detect near-duplicates within LSH; embedding check catches templated-but-distinct.

3.3 CLI contract tests (vault-cli/tests/cli/)

Scope: end-to-end Typer CliRunner invocations.

  • test_vault_new.py: new question creates file at correct path, ID allocated, YAML validates. --count N --batch opens N drafts in $EDITOR. Validation failure injects error comment + stderr.
  • test_vault_edit.py: edit persists changes, re-validates, exits 1 on invalid save.
  • test_vault_rm.py: default soft-delete sets status: deprecated; --hard without typed confirmation refuses; typed confirmation hard-deletes.
  • test_vault_move.py: reclassify via git mv; refuses on dirty tree; refuses chain-breaking move.
  • test_vault_restore.py: deprecated → published round-trip.
  • test_vault_build.py: builds vault.db with correct release_hash; --local-json emits corpus.json.
  • test_vault_check.py: fast-tier <1s, structural-tier <30s on fixture; --json emits LSP-diagnostic shapes.
  • test_vault_publish.py: composed product equivalent to primitive sequence; .ship-journal.json absent on pre-ship state.
  • test_vault_ship_journal.py: simulated sub-step failure populates journal; --resume continues; paper-leg failure pages (mock alerting).
  • test_vault_deploy.py: pre-deploy R2 snapshot synchronous; POP propagation probe blocks on unverified stale.
  • test_vault_rollback.py: snapshot method restores to verified state; sql method works on content-only release.
  • test_vault_verify.py: exit 0 on match, exit 1 on any divergence (id-registry, content-hash, release_hash).
  • test_vault_generate.py: exemplar-pool enforcement (refuses <3 eligible); dry-run emits cost estimate without API call; cap --count ≤25; secrets file mode 0600 enforced; daily ledger refuses over-ceiling.
  • test_vault_api.py: local shim mirrors Worker endpoint surface; schemas match shared-types codegen.
  • test_vault_doctor.py: each subcheck runnable independently; --json emits stable schema.
  • test_exit_code_taxonomy.py: each failure mode returns the correct exit code per §4.6.

3.4 Data-migration tests (one-time, Phase 1 exit gate)

  • test_corpus_json_split.py: current corpus.json → per-question YAML produces set-equal IDs.
  • test_content_hash_parity.py: per-question content_hash from YAML matches hash recomputed from original corpus.json fields (post-whitelist + canonicalization).
  • test_chain_graph_isomorphism.py: chain structure in vault.db matches chains.json exactly.
  • test_policy_filter_parity.py: {published, validated} predicate produces the same set as current paper's analyze_corpus.py AND the same set as site's filter predicate — converged count (no more 9,199 vs 8,053).

3.5 Shared-types drift tests

  • test_codegen_drift.py: re-run LinkML codegen in tempdir, diff against committed @staffml/vault-types/, vault-cli/src/vault_cli/models/, releases/<latest>/schema.sql. Any diff → fail PR. CI never auto-fixes.

3.6 Worker contract tests (staffml-vault-worker/tests/)

Scope: Worker behavior under miniflare (Cloudflare Workers local dev).

  • test_endpoint_shapes.py: every endpoint's JSON matches @staffml/vault-types schema.
  • test_cursor_pagination.py: opaque cursors; stable order across calls; invalid cursor rejected.
  • test_etag.py: ETag format "<release_id>:<resource>:<content_hash>"; 304 on If-None-Match.
  • test_cache_release_key.py: new release_id in release_metadata invalidates all Cache API entries.
  • test_schema_fingerprint.py: cold-start hashes sqlite_master; mismatch → X-Vault-Degraded header + Cache-API-only mode, not 5xx.
  • test_grace_window.py: during 10-min post-deploy window, worker accepts both current and previous release_id.
  • test_rate_limit.py: 60 req/min/IP on GETs, 10 req/min on /search.
  • test_cors.py: production domains allowed, wildcard not.
  • test_no_mutations.py: no public mutation endpoints; POST/PUT/DELETE to any non-admin path → 405.
  • test_admin_removed.py: POST /admin/release returns 404.

3.7 End-to-end tests (Playwright, Phase 4)

Against staging site pointing at staging D1. Full flows:

  • e2e_practice.spec.ts: load → filter by track → reveal → navigate chain → AskInterviewer tutor.
  • e2e_gauntlet.spec.ts: start session → answer N → view post-mortem.
  • e2e_progress.spec.ts: attempts persist; due count correct.
  • e2e_landing.spec.ts: question count matches manifest; no 19MB JS chunk in network tab.
  • e2e_about.spec.ts: "Read the paper" visible above fold; BibTeX rendered; release_hash in footer.
  • e2e_search.spec.ts: ⌘K opens palette; debounce; results ranked; snippet highlights.
  • e2e_offline.spec.ts: service worker serves last 200 questions after DevTools offline toggle.
  • e2e_fallback.spec.ts: NEXT_PUBLIC_VAULT_FALLBACK=static serves corpus.json fallback.
  • e2e_rollback_drill.spec.ts: live site → flip fallback flag → site keeps serving (rehearsed pre-production).

3.8 Smoke tests (vault smoke-test, pre-every-deploy)

50 random question IDs per run: vault smoke-test --env <env> --samples 50.

For each ID:

  • Worker response GET /questions/:id.
  • Direct D1 query SELECT * FROM questions WHERE id=?.
  • Direct vault.db query from releases/<v>/vault.db.

Assert: all three return byte-identical JSON. Any divergence fails the smoke-test; deploy aborts.

Additional checks:

  • release_id consistency across /manifest, /questions/:id, and release_metadata.
  • Cache API hit rate > 80% on repeat call (warm).

3.9 Load tests (k6, Phase 4 pre-cutover + pre-scale-milestones)

Scenarios against staging with production-scale corpus:

  • Steady-state: 100 req/s sustained for 10 min.
  • Burst: 500 req/s for 30s.
  • Search-heavy: 20% of traffic is /search with realistic query mix (60% term, 20% phrase, 20% multi-term).
  • Cold-path injection: 10% of requests hit non-warm POPs (via header manipulation).

Gates (ARCHITECTURE.md §10.6):

  • p99 warm ≤ 100ms on /search.
  • p99 cold ≤ 500ms.
  • D1 row-reads/FTS5-query ≤ 500.
  • 5xx rate = 0%.
  • Cost projection ≤ $30/mo at 1K DAU scale.

3.10 Rollback tests

  • Property test on every publish: for every release N, apply(d1-migration.sql) + apply(d1-rollback.sql) produces state whose release_hash matches pre-migration state. Runs in CI.
  • Snapshot-restore drill, quarterly: restore staging D1 from an R2 snapshot end-to-end. Timed. Target RTO ≤ 10 min decision-to-restored.
  • Feature-flag rollback drill, pre-Phase-4: staging site with active service worker; flip NEXT_PUBLIC_VAULT_FALLBACK=static; redeploy; verify SW evicts stale cache; site serves from inlined corpus.

3.11 Data-plane SLI probes (continuous, post-Phase-3)

Worker crons per ARCHITECTURE.md §10.5 table. Alert on any divergence.

  • 5-min: row-count parity vault.db vs D1.
  • Hourly: 20-sample content_hash match.
  • Hourly: FTS5 row count vs base table.
  • Daily: provenance distribution, staleness, validation-failure rate on main.
  • On deploy + hourly: /manifest release_id propagation across 8 POPs.

4. CI workflow spec

4.1 .github/workflows/staffml-validate-vault.yml (every PR touching interviews/vault/ or interviews/vault-cli/)

name: '🎯 StaffML · ✅ Validate (Vault)'
on:
  pull_request:
    paths:
      - 'interviews/vault/**'
      - 'interviews/vault-cli/**'
      - 'interviews/staffml-vault-worker/**'
      - 'interviews/staffml/package.json'

jobs:
  validate:
    runs-on: ubuntu-latest  # upgrade to larger-runner if budget exceeded
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }  # PINNED — hash stability
      - run: pip install -e interviews/vault-cli/[dev]
      - run: vault --version

      # Fast + structural invariants (<60s budget)
      - run: vault check --strict
        timeout-minutes: 2

      # Build + Merkle equivalence (<30s)
      - run: vault build
      - name: Compare release_hash vs corpus-equivalence-hash.txt
        run: |
          EXPECTED=$(cat interviews/vault/corpus-equivalence-hash.txt)
          ACTUAL=$(vault stats --format json | jq -r .release_hash)
          [ "$EXPECTED" = "$ACTUAL" ] || exit 1

      # Shared-types codegen drift check
      - run: vault codegen --check

      # Registry append-only check
      - run: |
          git fetch origin main
          python interviews/vault-cli/scripts/check_registry_append_only.py

      # Unit + integration + CLI contract tests
      - run: pytest interviews/vault-cli/tests/ -x --timeout=60

      # Rollback symmetry (if this is a release commit on main)
      - name: Rollback symmetry property test
        if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
        run: pytest interviews/vault-cli/tests/release/ --rollback-symmetry

Status check required for PR merge.

4.2 .github/workflows/vault-nightly.yml

name: vault-nightly
on:
  schedule:
    - cron: '0 8 * * *'  # 08:00 UTC

jobs:
  slow-checks:
    runs-on: ubuntu-latest
    steps:
      - run: vault check --tier slow     # link rot, LLM math, Pint units
      - run: vault doctor --check-links  # updates vault/link-rot.yaml
      - run: vault stats --format prometheus > /tmp/metrics.prom
      - run: curl -X POST "$PROMETHEUS_PUSHGATEWAY" --data-binary @/tmp/metrics.prom

  data-plane-sli-sweep:
    runs-on: ubuntu-latest
    steps:
      - run: vault smoke-test --env production --samples 100

4.3 .github/workflows/vault-worker-deploy.yml (invoked by vault ship)

on:
  workflow_dispatch:
    inputs: { version: {required: true}, env: {required: true} }

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: ${{ inputs.env }}  # gates production behind env-protection review
    steps:
      - run: vault verify ${{ inputs.version }}
      - run: vault deploy ${{ inputs.version }} --env ${{ inputs.env }}
      - run: vault smoke-test --env ${{ inputs.env }} --samples 50

5. Phase-entry gates

Phase Entry gate Testing artifacts required
Phase 0 nothing
Phase 1 Phase 0 deliverables green EVOLUTION.md committed; vault-cli/README.md; vault-cli skeleton + pytest green
Phase 2 Phase 1 milestone: vault build release_hash matches equivalence-hash Data-migration tests green; drift fixtures cover all invariant classes
Phase 3 License decision resolved (L-10) Content-hash parity test green; policy-filter-parity test green; rollback-symmetry test green on test corpus
Phase 4 FTS5 load-test gate: p99 warm ≤100ms, p99 cold ≤500ms, row-reads ≤500/query k6 load test artifacts; worker contract tests green; smoke-test infrastructure deployed to staging
Phase 4 cutover Lighthouse CI green on practice/gauntlet/landing; E2E suite green; rollback drill executed in staging Lighthouse artifacts; Playwright reports; rollback-drill recording
Phase 5 Phase 4 cutover stable 48h Chain-badge instrumentation emitting events; SLI dashboards green
Phase 6 Phase 5 shipped About-page e2e test green

6. Observability + rollback during Phase 4 rollout

See ARCHITECTURE.md §19.5 + §10.5. Summary:

  • Transport: 5xx rate alert (>1% over 5 min), p99 latency alert (>500ms sustained), request anomaly.
  • Data-plane: row-count parity, content-hash sampling, FTS5 parity, schema_fingerprint parity, manifest-release_id-propagation across POPs. Any divergence → red.
  • Canary: vault ship --canary-percent 10 → 50 → 100, soak = max(15 min, ≥100 sessions observed).
  • Rollback: NEXT_PUBLIC_VAULT_FALLBACK=static redeploy. Rehearsed on staging pre-cutover.
  • Dashboards: Cloudflare Analytics + Grafana (scraped from vault stats --format prometheus).
  • Alerting: PagerDuty on any red SLI; Slack on yellow.

7. Living document

TESTING.md evolves with the CLI. Each new vault subcommand adds a test_vault_<cmd>.py file. Each new invariant in ARCHITECTURE.md §5 adds a drift fixture. PRs that add functionality without tests are CI-blocked by coverage minimum (initially 80%, ratcheted upward).

End of testing plan.