mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-07 18:18:42 -05:00

Files

Vijay Janapa Reddi 2b381bb949 refactor(vault-cli): rename --legacy-json to --local-json

The flag is the StaffML frontend's local-dev fallback (read corpus.json
from disk via NEXT_PUBLIC_VAULT_FALLBACK=static), not a deprecated path.
"Legacy" implied "soon to be removed"; "local-json" describes its actual
role and reads correctly in scripts and docs.

- vault-cli: rename CLI flag, parameter, result key, and help text.
- CI workflows + pre-commit config: invoke the new flag name.
- All scripts that print the command (suggest_exemplars,
  pre_commit_corpus_guard, promote_validated, rename_legacy_ids,
  export_to_staffml, the paper analyze_corpus/generate_*) updated.
- Comments and docs (ARCHITECTURE, CHANGELOG, REVIEWS, TESTING,
  MASSIVE_BUILD_RUNBOOK, DEPRECATED, AUTHORING, plus frontend
  comments and .env.example / .gitignore) updated.

The "legacy_json" sentinel string in corpus_stats.json._meta.source
is intentionally NOT renamed — it is a stable artifact format read
by downstream paper-generation tooling.

2026-04-30 09:30:28 -04:00

19 KiB

Raw Permalink Blame History

StaffML Vault — Testing Plan

Scope: Full test strategy for vault-cli, the Worker API, and the cutover flow. Expands ARCHITECTURE.md §19 with concrete test inventory, fixtures, CI workflow spec, and phase gates. Status: v1, drafted 2026-04-15 alongside ARCHITECTURE.md v2.1.

1. Test pyramid overview

Layer	Runtime	Runs on	Blocks	Budget
Unit	pytest	every commit	PR merge	≤5s
Integration	pytest	every commit	PR merge	≤30s
Contract (CLI end-to-end)	pytest via Typer CliRunner	every commit	phase transition	≤20s
Data-migration (corpus.json → YAML → vault.db)	pytest	once at Phase 1 exit	Phase 2 start	n/a
Equivalence (Merkle hash, site/paper parity)	CI job	every PR Phase 1–3	PR merge	≤120s
Shared-types drift	CI job	every PR post-Phase-1	PR merge	≤30s
Worker contract	pytest + miniflare	every commit Phase 3+	phase transition	≤60s
End-to-end (real site + staging D1)	Playwright	pre-cutover Phase 4	production cutover	≤10min
Smoke (50 IDs: worker vs direct DB)	`vault smoke-test`	pre-every-prod-deploy	production deploy	≤60s
Load (worker under realistic traffic)	k6	pre-cutover Phase 4	production cutover	≤30min
Rollback (symmetry, drill)	pytest + manual	every publish + quarterly	every deploy	≤15min
Data-plane SLI	Worker cron	continuously post-Phase-3	n/a (alerts)	≤5s/run

CI wall-clock budget on a standard GitHub runner: ≤2 min for all "every PR" checks combined.

2. Test fixtures

2.1 Frozen 20-question test corpus (`vault-cli/tests/fixtures/test-corpus/`)

Structure mirrors production vault/questions/ exactly. Never changes unless the fixture bump is explicit and documented.

Coverage requirements:

5 tracks × 6 levels × subset of zones = 20 questions hitting every major (track, level, zone) combination.
3 chained sequences: one 2-deep, one 3-deep, one 4-deep — exercises chain position invariants.
1 deprecated question: exercises status-transition invariants.
1 LLM-draft + 1 llm-then-human-edited: exercises provenance invariants.
1 question in excluded applicability cell: deliberately included to assert the validator rejects it (negative fixture).

Fixture includes:

questions/*.yaml (20 files).
taxonomy.yaml, chains.yaml, zones.yaml (frozen subsets referenced by the fixture).
release-policy.yaml (frozen).
id-registry.yaml (frozen append-only log).
release-1.0.0/ (expected snapshot artifacts for comparison).

2.2 Golden `vault.db` (`vault-cli/tests/fixtures/golden/vault.db`)

Expected SQLite output from running vault build against the frozen test corpus. Regenerated by vault-cli/tests/regenerate_golden.py when the canonicalization version bumps (rare, explicit, reviewed).

Primary integration test: built vault.db byte-compares against golden via Merkle release_hash equality — not raw SQLite byte-diff (SQLite is not byte-reproducible; see ARCHITECTURE.md §3.5).

2.3 Schema-drift fixtures (`vault-cli/tests/fixtures/drift/`)

Deliberately-broken YAMLs asserting each validator class catches its error:

missing-required-field.yaml → fast invariant #8 (required fields).
unknown-topic.yaml → structural invariant #11 (taxonomy gate).
broken-chain-ref.yaml → structural invariant #12.
chain-position-hole.yaml → structural invariant #13 (contiguous [1..N]).
wrong-schema-version.yaml → loader rejects.
yaml-billion-laughs.yaml → YAML hardening (invariant #9).
yaml-oversized.yaml (>256KB) → file-size cap.
html-in-scenario.yaml → content-format invariant #10.
http-deep-dive-url.yaml → URL scheme allowlist.
case-mismatched-path.yaml → path lowercase invariant #4.
unknown-path-component.yaml → path enum invariant #5.
duplicate-id.yaml → fast invariant #2 (unique IDs).
id-path-mismatch.yaml → registry/file consistency check.
provenance-llm-without-meta.yaml → invariant #18.
exemplar-unreviewed-llm.yaml → invariant #19.

Each drift fixture has an associated pytest asserting the exact error class raised.

2.4 Cross-release fixtures (`vault-cli/tests/fixtures/releases/`)

release-1.0.0-to-1.0.1/ — content-only delta (modified scenarios, same schema) → exercises migrations emit and rollback symmetry.
release-1.0.1-to-1.1.0/ — schema-changing delta (new optional field) → exercises --schema-change gate and forward-only migration.

3. Test inventory per layer

3.1 Unit tests (`vault-cli/tests/unit/`)

Scope: pure functions, no I/O.

test_id_allocation.py: content-hash derivation, 6-hex prefix stability, dedup-seq logic, collision detection.
test_canonical_hash.py: key-order invariance (nested dicts hash identically with reordered keys), Unicode NFC normalization, LF normalization, excluded-fields whitelist correctness.
test_merkle.py: release_hash leaf construction, __policy__ + __canon_version__ inclusion, ordering stability.
test_path_parser.py: lowercase enforcement, enum validation, extract-classification-from-path round-trip.
test_chain_form.py: structured {id, position} accepted, legacy <id>@<pos> accepted on read (not write).
test_provenance_enum.py: closed enum rejection of unknown values; generation_meta required iff not human.
test_content_format.py: plaintext/markdown/URL rules per field; allowlist rendering.
test_yaml_hardening.py: billion-laughs rejected, size cap, depth cap, alias rejection, timeout.
test_exit_codes.py: each exit code returned by the correct failure mode.
test_policy_predicate.py: release-policy.yaml filter is the single source; import-graph check enforced.

3.2 Integration tests (`vault-cli/tests/integration/`)

Scope: multi-component, hits filesystem/SQLite but no network.

test_build_equivalence.py: YAML → vault.db produces matching release_hash vs golden.
test_validator_drift.py: each drift fixture raises the expected invariant-failure class.
test_registry_append_only.py: a commit that removes a registry line is rejected by vault check.
test_publish_atomicity.py: simulated mid-publish failure (step 5 kill) leaves .pending-<v>/, vault publish --resume completes cleanly.
test_migrations_round_trip.py: forward-then-rollback on SQL path produces byte-identical pre-state (via content-hash, not binary diff).
test_migrations_snapshot.py: pre-deploy-snapshot → forward migration → snapshot-restore → vault verify passes.
test_chain_integrity.py: vault rm --hard on chained question refuses; vault move of single member of chain refuses.
test_rename_on_move.py: vault move preserves filename (topic+hash+seq); git follows rename.
test_renumber_recovery.py: simulated post-rebase collision → vault renumber produces valid state.
test_scenario_dedup_lsh.py: MinHash buckets detect near-duplicates within LSH; embedding check catches templated-but-distinct.

3.3 CLI contract tests (`vault-cli/tests/cli/`)

Scope: end-to-end Typer CliRunner invocations.

test_vault_new.py: new question creates file at correct path, ID allocated, YAML validates. --count N --batch opens N drafts in $EDITOR. Validation failure injects error comment + stderr.
test_vault_edit.py: edit persists changes, re-validates, exits 1 on invalid save.
test_vault_rm.py: default soft-delete sets status: deprecated; --hard without typed confirmation refuses; typed confirmation hard-deletes.
test_vault_move.py: reclassify via git mv; refuses on dirty tree; refuses chain-breaking move.
test_vault_restore.py: deprecated → published round-trip.
test_vault_build.py: builds vault.db with correct release_hash; --local-json emits corpus.json.
test_vault_check.py: fast-tier <1s, structural-tier <30s on fixture; --json emits LSP-diagnostic shapes.
test_vault_publish.py: composed product equivalent to primitive sequence; .ship-journal.json absent on pre-ship state.
test_vault_ship_journal.py: simulated sub-step failure populates journal; --resume continues; paper-leg failure pages (mock alerting).
test_vault_deploy.py: pre-deploy R2 snapshot synchronous; POP propagation probe blocks on unverified stale.
test_vault_rollback.py: snapshot method restores to verified state; sql method works on content-only release.
test_vault_verify.py: exit 0 on match, exit 1 on any divergence (id-registry, content-hash, release_hash).
test_vault_generate.py: exemplar-pool enforcement (refuses <3 eligible); dry-run emits cost estimate without API call; cap --count ≤25; secrets file mode 0600 enforced; daily ledger refuses over-ceiling.
test_vault_api.py: local shim mirrors Worker endpoint surface; schemas match shared-types codegen.
test_vault_doctor.py: each subcheck runnable independently; --json emits stable schema.
test_exit_code_taxonomy.py: each failure mode returns the correct exit code per §4.6.

3.4 Data-migration tests (one-time, Phase 1 exit gate)

test_corpus_json_split.py: current corpus.json → per-question YAML produces set-equal IDs.
test_content_hash_parity.py: per-question content_hash from YAML matches hash recomputed from original corpus.json fields (post-whitelist + canonicalization).
test_chain_graph_isomorphism.py: chain structure in vault.db matches chains.json exactly.
test_policy_filter_parity.py: {published, validated} predicate produces the same set as current paper's analyze_corpus.py AND the same set as site's filter predicate — converged count (no more 9,199 vs 8,053).

3.5 Shared-types drift tests

test_codegen_drift.py: re-run LinkML codegen in tempdir, diff against committed @staffml/vault-types/, vault-cli/src/vault_cli/models/, releases/<latest>/schema.sql. Any diff → fail PR. CI never auto-fixes.

3.6 Worker contract tests (`staffml-vault-worker/tests/`)

Scope: Worker behavior under miniflare (Cloudflare Workers local dev).

test_endpoint_shapes.py: every endpoint's JSON matches @staffml/vault-types schema.
test_cursor_pagination.py: opaque cursors; stable order across calls; invalid cursor rejected.
test_etag.py: ETag format "<release_id>:<resource>:<content_hash>"; 304 on If-None-Match.
test_cache_release_key.py: new release_id in release_metadata invalidates all Cache API entries.
test_schema_fingerprint.py: cold-start hashes sqlite_master; mismatch → X-Vault-Degraded header + Cache-API-only mode, not 5xx.
test_grace_window.py: during 10-min post-deploy window, worker accepts both current and previous release_id.
test_rate_limit.py: 60 req/min/IP on GETs, 10 req/min on /search.
test_cors.py: production domains allowed, wildcard not.
test_no_mutations.py: no public mutation endpoints; POST/PUT/DELETE to any non-admin path → 405.
test_admin_removed.py: POST /admin/release returns 404.

3.7 End-to-end tests (Playwright, Phase 4)

Against staging site pointing at staging D1. Full flows:

e2e_practice.spec.ts: load → filter by track → reveal → navigate chain → AskInterviewer tutor.
e2e_gauntlet.spec.ts: start session → answer N → view post-mortem.
e2e_progress.spec.ts: attempts persist; due count correct.
e2e_landing.spec.ts: question count matches manifest; no 19MB JS chunk in network tab.
e2e_about.spec.ts: "Read the paper" visible above fold; BibTeX rendered; release_hash in footer.
e2e_search.spec.ts: ⌘K opens palette; debounce; results ranked; snippet highlights.
e2e_offline.spec.ts: service worker serves last 200 questions after DevTools offline toggle.
e2e_fallback.spec.ts: NEXT_PUBLIC_VAULT_FALLBACK=static serves corpus.json fallback.
e2e_rollback_drill.spec.ts: live site → flip fallback flag → site keeps serving (rehearsed pre-production).

3.8 Smoke tests (`vault smoke-test`, pre-every-deploy)

50 random question IDs per run: vault smoke-test --env <env> --samples 50.

For each ID:

Worker response GET /questions/:id.
Direct D1 query SELECT * FROM questions WHERE id=?.
Direct vault.db query from releases/<v>/vault.db.

Assert: all three return byte-identical JSON. Any divergence fails the smoke-test; deploy aborts.

Additional checks:

release_id consistency across /manifest, /questions/:id, and release_metadata.
Cache API hit rate > 80% on repeat call (warm).

3.9 Load tests (k6, Phase 4 pre-cutover + pre-scale-milestones)

Scenarios against staging with production-scale corpus:

Steady-state: 100 req/s sustained for 10 min.
Burst: 500 req/s for 30s.
Search-heavy: 20% of traffic is /search with realistic query mix (60% term, 20% phrase, 20% multi-term).
Cold-path injection: 10% of requests hit non-warm POPs (via header manipulation).

Gates (ARCHITECTURE.md §10.6):

p99 warm ≤ 100ms on /search.
p99 cold ≤ 500ms.
D1 row-reads/FTS5-query ≤ 500.
5xx rate = 0%.
Cost projection ≤ $30/mo at 1K DAU scale.

3.10 Rollback tests

Property test on every publish: for every release N, apply(d1-migration.sql) + apply(d1-rollback.sql) produces state whose release_hash matches pre-migration state. Runs in CI.
Snapshot-restore drill, quarterly: restore staging D1 from an R2 snapshot end-to-end. Timed. Target RTO ≤ 10 min decision-to-restored.
Feature-flag rollback drill, pre-Phase-4: staging site with active service worker; flip NEXT_PUBLIC_VAULT_FALLBACK=static; redeploy; verify SW evicts stale cache; site serves from inlined corpus.

3.11 Data-plane SLI probes (continuous, post-Phase-3)

Worker crons per ARCHITECTURE.md §10.5 table. Alert on any divergence.

5-min: row-count parity vault.db vs D1.
Hourly: 20-sample content_hash match.
Hourly: FTS5 row count vs base table.
Daily: provenance distribution, staleness, validation-failure rate on main.
On deploy + hourly: /manifest release_id propagation across 8 POPs.

4. CI workflow spec

4.1 `.github/workflows/staffml-validate-vault.yml` (every PR touching `interviews/vault/` or `interviews/vault-cli/`)

name: '🎯 StaffML · ✅ Validate (Vault)'
on:
  pull_request:
    paths:
      - 'interviews/vault/**'
      - 'interviews/vault-cli/**'
      - 'interviews/staffml-vault-worker/**'
      - 'interviews/staffml/package.json'

jobs:
  validate:
    runs-on: ubuntu-latest  # upgrade to larger-runner if budget exceeded
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }  # PINNED — hash stability
      - run: pip install -e interviews/vault-cli/[dev]
      - run: vault --version

      # Fast + structural invariants (<60s budget)
      - run: vault check --strict
        timeout-minutes: 2

      # Build + Merkle equivalence (<30s)
      - run: vault build
      - name: Compare release_hash vs corpus-equivalence-hash.txt
        run: |
          EXPECTED=$(cat interviews/vault/corpus-equivalence-hash.txt)
          ACTUAL=$(vault stats --format json | jq -r .release_hash)
          [ "$EXPECTED" = "$ACTUAL" ] || exit 1

      # Shared-types codegen drift check
      - run: vault codegen --check

      # Registry append-only check
      - run: |
          git fetch origin main
          python interviews/vault-cli/scripts/check_registry_append_only.py

      # Unit + integration + CLI contract tests
      - run: pytest interviews/vault-cli/tests/ -x --timeout=60

      # Rollback symmetry (if this is a release commit on main)
      - name: Rollback symmetry property test
        if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
        run: pytest interviews/vault-cli/tests/release/ --rollback-symmetry

Status check required for PR merge.

4.2 `.github/workflows/vault-nightly.yml`

name: vault-nightly
on:
  schedule:
    - cron: '0 8 * * *'  # 08:00 UTC

jobs:
  slow-checks:
    runs-on: ubuntu-latest
    steps:
      - run: vault check --tier slow     # link rot, LLM math, Pint units
      - run: vault doctor --check-links  # updates vault/link-rot.yaml
      - run: vault stats --format prometheus > /tmp/metrics.prom
      - run: curl -X POST "$PROMETHEUS_PUSHGATEWAY" --data-binary @/tmp/metrics.prom

  data-plane-sli-sweep:
    runs-on: ubuntu-latest
    steps:
      - run: vault smoke-test --env production --samples 100

4.3 `.github/workflows/vault-worker-deploy.yml` (invoked by `vault ship`)

on:
  workflow_dispatch:
    inputs: { version: {required: true}, env: {required: true} }

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: ${{ inputs.env }}  # gates production behind env-protection review
    steps:
      - run: vault verify ${{ inputs.version }}
      - run: vault deploy ${{ inputs.version }} --env ${{ inputs.env }}
      - run: vault smoke-test --env ${{ inputs.env }} --samples 50

5. Phase-entry gates

Phase	Entry gate	Testing artifacts required
Phase 0	nothing	—
Phase 1	Phase 0 deliverables green	EVOLUTION.md committed; vault-cli/README.md; vault-cli skeleton + pytest green
Phase 2	Phase 1 milestone: `vault build` release_hash matches equivalence-hash	Data-migration tests green; drift fixtures cover all invariant classes
Phase 3	License decision resolved (L-10)	Content-hash parity test green; policy-filter-parity test green; rollback-symmetry test green on test corpus
Phase 4	FTS5 load-test gate: p99 warm ≤100ms, p99 cold ≤500ms, row-reads ≤500/query	k6 load test artifacts; worker contract tests green; smoke-test infrastructure deployed to staging
Phase 4 cutover	Lighthouse CI green on practice/gauntlet/landing; E2E suite green; rollback drill executed in staging	Lighthouse artifacts; Playwright reports; rollback-drill recording
Phase 5	Phase 4 cutover stable 48h	Chain-badge instrumentation emitting events; SLI dashboards green
Phase 6	Phase 5 shipped	About-page e2e test green

6. Observability + rollback during Phase 4 rollout

See ARCHITECTURE.md §19.5 + §10.5. Summary:

Transport: 5xx rate alert (>1% over 5 min), p99 latency alert (>500ms sustained), request anomaly.
Data-plane: row-count parity, content-hash sampling, FTS5 parity, schema_fingerprint parity, manifest-release_id-propagation across POPs. Any divergence → red.
Canary: vault ship --canary-percent 10 → 50 → 100, soak = max(15 min, ≥100 sessions observed).
Rollback: NEXT_PUBLIC_VAULT_FALLBACK=static redeploy. Rehearsed on staging pre-cutover.
Dashboards: Cloudflare Analytics + Grafana (scraped from vault stats --format prometheus).
Alerting: PagerDuty on any red SLI; Slack on yellow.

7. Living document

TESTING.md evolves with the CLI. Each new vault subcommand adds a test_vault_<cmd>.py file. Each new invariant in ARCHITECTURE.md §5 adds a drift fixture. PRs that add functionality without tests are CI-blocked by coverage minimum (initially 80%, ratcheted upward).

End of testing plan.

19 KiB Raw Permalink Blame History Unescape Escape

StaffML Vault — Testing Plan

1. Test pyramid overview

2. Test fixtures

2.1 Frozen 20-question test corpus (vault-cli/tests/fixtures/test-corpus/)

2.2 Golden vault.db (vault-cli/tests/fixtures/golden/vault.db)

2.3 Schema-drift fixtures (vault-cli/tests/fixtures/drift/)

2.4 Cross-release fixtures (vault-cli/tests/fixtures/releases/)