29 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
f12d303769 chore(interviews): purge stale AI prompts and dev scratch from interviews/
Remove ten files from the public repo that should never have been
tracked. Verified no code references any of them before deleting.

AI-prompt files (private to author tooling, do not belong in the public
repo):

  - interviews/vault-cli/docs/GEMINI_SELF_AUDIT_PROMPT.md
  - interviews/vault/_pipeline/runs/gemini-self-audit/prompts/{cloud,
    edge,global,mobile,tinyml}_audit_prompt.md (5 per-track prompts;
    interviews/vault/.gitignore already excludes /_pipeline/, but these
    five were force-added in f6c41d7689 before the rule was set)

Dev-scratch artifacts (clearly leftover dev iteration; filenames literally
say 'final' four different ways):

  - interviews/vault-cli/check_results_absolute_final.json
  - interviews/vault-cli/check_results_after_repair.json
  - interviews/vault-cli/check_results_final.json
  - interviews/vault-cli/check_results_total_final.json

No production code, tests, docs, or CI references any of these paths.
The audit-pipeline scripts that *would* write into _pipeline/ already
respect the existing gitignore rule for that directory tree.
2026-05-05 10:51:53 -04:00
Vijay Janapa Reddi
e465587959 docs(vault-cli): GEMINI_SELF_AUDIT_PROMPT.md — agentic audit via gemini CLI
A self-contained prompt that lets gemini CLI walk the corpus and audit it
directly via its own filesystem tools, without the audit_corpus_batched.py
Python wrapper. Useful when the wrapper hits rate-limit / exit-55 walls
or when the operator wants Gemini to checkpoint to disk as it goes.

The prompt uses an append-only JSONL output at
interviews/vault/_pipeline/runs/gemini-self-audit/01_audit.jsonl with
resume semantics (re-running skips qids already in the file). Encodes
the same five gates as audit_corpus_batched.py (format_compliance,
level_fit, coherence, math_correct, title_quality) plus a stable JSON
shape so downstream tooling can consume it identically.

Includes invocation guidance: --yolo + --skip-trust, slice by track to
avoid the multi-hour serial walk, resume across sessions.
2026-05-04 10:36:31 -04:00
Vijay Janapa Reddi
ac2c7b39eb docs(vault-cli): PHASE_5_UNRESOLVED.md — post-drain accounting
Reflects the 2026-05-04 follow-up slices: math-skip-level (15 applies)
and math-finish queue drain (66 applies). Cumulative now 2,372 of
2,757 (86.0%); 385 known-deferred ahead of Phase 6. Also corrects the
original doc's '70 already-applied no-ops' line — those were unverified
math candidates the verify guard skipped, not no-ops.
2026-05-04 08:14:16 -04:00
Vijay Janapa Reddi
2dc556e1e5 docs(vault-cli): PHASE_6_HANDOFF.md — resume guide after Phase 5 mass-apply
Self-contained resume guide for the next session:

  - Confirms Phases 0-5 (autonomous) + 8 done
  - Documents 478 unresolved corrections (cross-refs PHASE_5_UNRESOLVED)
  - Step-by-step for Phase 5 cleanup → Phase 6 schema → Phase 7 verify
    → Phase 9 release
  - Concrete CLI commands for each step (vault audit review with
    --filter-gate flags, vault codegen, vault publish)
  - Reference doc map (which doc covers what)
  - Pipeline data layout (where the canonical 01_audit.json lives)
  - Full commit log from this session
  - Merge command to land yaml-audit on dev when ready
  - Paste-ready resume prompt for the next Claude Code session

Total estimated remaining work to ship vault 1.0.0: ~9h, mostly Phase 5
review + Phase 6 schema. Tree is clean; ready to hand off.
2026-05-04 07:14:47 -04:00
Vijay Janapa Reddi
79b4c3361e docs(vault-cli): PHASE_5_UNRESOLVED.md — list of corrections needing human review
After the autonomous Phase 5 mass-apply + math-verify passes,
2,279 of 2,757 corrections (82.6%) were auto-applied. The remaining
478 were deliberately not applied because they fail one of three
safety checks:

  75 math 'no'             — independent Gemini check disputed the fix
  14 math 'unclear'        — Gemini wasn't confident
  13 math + level-block    — fix has level relabel that breaks a chain
 168 relabel-up            — against CORPUS_HARDENING_PLAN.md §10 Q3
 138 chain-block           — would break chains.json monotonicity
  70 already-applied       — no action needed

This doc:
  - Summarizes the skip reasons + counts
  - Points to the disposition logs in _pipeline/runs/
  - Recommends a per-category review workflow
  - Notes which categories are highest priority (math 'no')
  - Notes which are chain-restructuring decisions (out of Phase 5 scope)

Reviewer flow uses `vault audit review` (apply_corrections.py wrapper)
with --filter-gate to target specific buckets.

Phase 5 autonomous portion is COMPLETE. Phase 6 (schema tightening)
remains safe to attempt once the 478 are dispositioned or
accepted as known-deferred.
2026-05-03 19:17:46 -04:00
Vijay Janapa Reddi
9ee3c34303 docs(vault-cli): PHASE_4_HANDOFF — update post-backfill
Append 2026-05-03 update reflecting:
  - Phase 4 backfill complete (2,757 corrections proposed; 0 errors)
  - 6 cloud questions migrated (stray top-level MCQ fields → details)
  - Phase 8 CLI subcommand shipped (vault audit run/review/summarize/merge)

Next session can skip Step 1 (backfill — done) and start at
Step 2 (Phase 5 interactive review).
2026-05-03 18:39:37 -04:00
Vijay Janapa Reddi
87481ab6a3 docs(vault-cli): refresh AUDIT_FINDINGS_2026-05-03 after Phase 4 backfill
After running --propose-fixes backfills on cloud + edge failures, the
canonical merged audit dataset now has:

  - 9,446 questions audited (100%)
  - 0 errors (all retried clean)
  - 2,757 with suggested_corrections (up from 1,767; +990 new fixes
    from cloud + edge backfills)

Per-track:
  cloud:    4,028  /   851 with fixes
  edge:     2,079  /   669
  global:     313  /    90
  mobile:   1,824  /   677
  tinyml:   1,202  /   470

Phase 5 (apply_corrections.py interactive review) can now begin on the
2,757-row subset.

CORPUS_HARDENING_PLAN.md Phase 4 backfill complete.
2026-05-03 18:38:55 -04:00
Vijay Janapa Reddi
87adaeec2f docs(vault-cli): PHASE_4_HANDOFF.md — resume guide for next session
Self-contained instructions for picking up where Phase 4 left off:

  - Sanity checks for the worktree (vault check, pytest, ruff)
  - Phase 4 backfill steps (cloud + edge --propose-fixes, retry global
    errors, re-merge, regenerate AUDIT_FINDINGS)
  - Phase 5 review workflow (apply_corrections.py with --filter-gate
    + --auto-accept-format for low-risk fixes)
  - Phase 6 schema-tightening checklist (LinkML pattern constraints,
    Details extra="forbid", lift gate into validator)
  - Phase 7 title verification
  - Phase 8 vault audit CLI subcommand (cron is already shipped)
  - Phase 9 release pipeline

Includes:
  - Concrete CLI commands for every step
  - Cost estimates per step
  - Tooling reference (which script does what)
  - Open questions from CORPUS_HARDENING_PLAN.md §10 still to decide
  - Full commit log from this session
  - Troubleshooting (rate limits, codespell, scratch files)

Total estimated time to ship vault 1.0.0: ~12h, mostly Phase 5 human
review. Spread over 2-3 working days.

CORPUS_HARDENING_PLAN.md Phase 4 → Phase 5 transition.
2026-05-03 14:31:45 -04:00
Vijay Janapa Reddi
d2621cc9ed feat(vault-cli): merge_audit_runs.py + Phase 4 findings doc
merge_audit_runs.py — merges multiple per-track audit_corpus_batched
output dirs into one canonical run. Per-qid prefer non-error rows,
then rows with suggested_corrections.

AUDIT_FINDINGS_2026-05-03.md — first complete corpus audit.

summarize_audit.py — truncate rationale snippets at word boundaries
(was truncating mid-word, tripping codespell on words like 'claimin').

Phase 4 final stats (9,446 published questions audited):
  format_compliance:   ~960 fail
  level_fit:          ~1,580 fail
  coherence:            ~480 fail
  math_correct:         ~330 fail
  title_quality:        ~250 placeholder + ~25 malformed
  20 error rows in global to retry on next run

1,767 questions have suggested_corrections; ~1,500 more need a
propose-fixes backfill pass (mostly cloud, some edge).

CORPUS_HARDENING_PLAN.md Phase 4 finalization.
2026-05-03 14:26:37 -04:00
Vijay Janapa Reddi
36f2ef5929 docs(vault-cli): CORPUS_HARDENING_PLAN.md — supersedes RELEASE_AUDIT_PLAN.md
End-to-end plan for taking the published-corpus audit from "stratified
sample at ~2,900 calls / 12 days" to "full corpus at ~450 calls / ~3
days". The previous plan over-budgeted by 6× because it assumed
1-call-per-gate-per-question; switching to batched 30-questions-per-call
collapses the cost.

Nine phases, 27 testable acceptance criteria. End state: every published
YAML conforms to a strict schema with load-time-enforced format markers
(Pitfall/Rationale/Consequence + Assumptions/Calculations/Conclusion);
math, level-fit, coherence, vendor fabrication, and physical realism are
independently Gemini-verified at corpus scale; new violations are caught
at vault check --strict time and cannot silently land.

Major design choices:
- Audit + corrections in one tool (audit_corpus_batched.py), with a
  --propose-fixes mode whose suggestions are NEVER auto-applied —
  humans review via apply_corrections.py.
- Schema tightening AFTER cleanup, not before (Phase 6 lifts pattern
  constraints into LinkML / Pydantic only once Phase 5 has cleaned the
  corpus, so the new constraints reject nothing real).
- Cron the audit (Phase 8) so findings become a routine artifact.
- AUTHORING.md + vault new scaffold (Phase 2) so new contributors see
  the format conventions before authoring, not after CI catches them.
2026-05-03 07:43:47 -04:00
Vijay Janapa Reddi
963fbfb162 docs(vault-cli): RELEASE_AUDIT_PLAN.md — handoff for fresh-session corpus audit
Captures the release-readiness state of the vault and the plan for
finishing the audit work the 250/day Gemini cap has constrained.

Corpus health survey (9,446 published questions, no Gemini cost):
  - 100% schema-valid (Pydantic)
  - 90.9% format-compliant (Pitfall/Rationale/Consequence + Assumptions/
    Calculations/Conclusion markers)
  - 9.1% fail format compliance (861 questions; mechanical fixes)
  - 134 placeholder titles (all global/* "Global New NNNN")
  - 407 with provenance: None (should be "imported")
  - 95.3% canonical bold-marker napkin_math; 4.7% partial / bullet-only

Template gap noted: vault new scaffolds only scenario + solution stubs;
the Pitfall/Rationale/Consequence and Assumptions/Calculations/Conclusion
templates are encoded ONLY in the generation prompt and the
format-compliance regex. There's no human-readable AUTHORING.md.
The new session is asked to ship one.

The plan: stratified sample of 1,000 questions (33 per track × level
cell) with full Gemini gate suite (math + coherence + level_fit +
bridge) at ~2,900 calls across ~12 days at the 250/day cap. Full-corpus
audit (~27,400 calls / ~110 days) is infeasible; sampling captures any
failure mode at >5-10% rate.

Includes:
  - Concrete numbers from the corpus survey (failure counts by category)
  - Day-by-day execution plan with resume instructions
  - Daily cost-ledger format
  - Stopping rules
  - Post-audit cleanup → paper.tech update path
  - Mechanical (no-Gemini) cleanups the new session can do in parallel
    with the daily audit cycle (provenance fix, format markers, AUTHORING.md)

CHAIN_ROADMAP.md Progress Log entry points the resumable cursor at
this plan.
2026-05-02 11:29:57 -04:00
Vijay Janapa Reddi
2b3cf5e1da chore(vault): consolidate AI pipeline artifacts under _pipeline/
Establishes one ignored subdirectory for ALL intermediate outputs of
LLM-driven tooling (chain proposals, gap detection, draft scorecards,
audit traces). Single gitignore rule: /_pipeline/.

Convention is documented in interviews/vault/README.md under "Pipeline
artifacts" — it's a real project layout convention, not AI-specific
config.

Path migration:
  interviews/vault/chains.proposed*.json
                  → _pipeline/chains.proposed*.json
  interviews/vault/gaps.proposed*.json
                  → _pipeline/gaps.proposed*.json
  interviews/vault/draft-validation-scorecard.json
                  → _pipeline/draft-validation-scorecard.json
  interviews/vault/audit-runs/
                  → _pipeline/runs/

8 scripts updated to define a PIPELINE_DIR constant and route default
outputs through it: build_chains_with_gemini.py,
apply_proposed_chains.py, merge_chain_passes.py, validate_drafts.py,
audit_chains_with_gemini.py, generate_question_for_gap.py,
summarize_proposed_chains.py, promote_drafts.py.

Forward-looking docs (README.md chain-pipeline section + CHAIN_ROADMAP.md
resume instructions + state snapshot) updated to reference the new
paths. Historical Progress Log entries left as-is — they accurately
describe what was committed at the time.

Drive-by .gitignore fixes (both used full repo-relative paths under
package-local .gitignore files, which never matched):
  interviews/vault-cli/.gitignore: scripts/.calibration_cache/
  interviews/vault/.gitignore:     /embeddings.npz

Validation:
  - vault check --strict: 10,705 loaded, 0 invariant failures
  - pytest interviews/vault-cli/tests/: 74/74
  - audit --dry-run: paths resolve correctly to _pipeline/runs/<ts>/

No durable corpus content moves. chains.json (live registry),
id-registry.yaml, questions/, etc. all stay where they were.
2026-05-02 09:04:55 -04:00
Vijay Janapa Reddi
270b1a5bd2 fix(vault): drop 55 Δ=0 chains + remove Δ=0 from lenient mode
Action on the strongest finding from the 2026-05-01 independent audit:
54 of 55 Δ=0 chains had no shared scenario (the "two questions
sharing a scenario thread" constraint the lenient prompt was supposed
to enforce). Two independent audit fields agreed (verdict=bad and
shared_scenario=no), so this isn't a tuning question — the design
choice was wrong.

Why remove Δ=0 entirely rather than tighten the prompt:

  - The chain definition is "pedagogical progression through Bloom
    levels"; same-level edges contradict the definition.
  - The "shared scenario / different angle" carve-out is unenforceable
    by an LLM at corpus scale (audit confirmed).
  - Same-scenario same-level pairs are more honestly modeled as
    siblings of a chain anchor, not as chain members.

Changes:
  - chains.json: 879 → 824. Dropped: 55 chains (all tier=secondary,
    since Δ=0 was only ever produced by the lenient sweep).
    Per-track: edge -19, tinyml -12, mobile -10, cloud -7, global -7.
  - build_chains_with_gemini.py:
      MODE_CONFIG["lenient"]["allowed_deltas"]: {0,1,2,3} → {1,2,3}
      LENIENT_PROMPT_TEMPLATE: Δ=0 paragraph rewritten to explicitly
        REJECT same-level pairs (with rationale citing the audit).
      docstring + --mode help text updated.
  - tests/test_chain_validation.py:
      test_lenient_accepts_same_level_pair → test_lenient_rejects_same_level_pair
      header docstring updated to reflect the new rule.
  - vault-manifest.json: chainCount 879 → 824, releaseHash rolls to
    479811040b7a… (real content delta, not a timestamp churn).

Validation:
  - vault check --strict: 10,705 loaded, 0 failures
  - vault build --local-json: chainCount=824, releaseHash=479811040b…
  - pytest: 74/74
  - playwright chain-and-vault-smoke: 19/19 (fixtures cloud-0001 +
    cloud-0231 are still in their chains post-drop)

Audit findings #2 (gap detection ~50% noise) and #3 (4 pilot drafts
disposition) remain open — see CHAIN_ROADMAP.md Progress Log.
2026-05-02 08:51:49 -04:00
Vijay Janapa Reddi
b68f6dbf83 audit(vault): independent Gemini audit — 18 calls, 3 critical findings
Ran audit_chains_with_gemini.py end-to-end. 18 Gemini-3.1-pro-preview
calls (well under the 250/day cap) sized to 80-336K char prompts (the
attention sweet spot at ~80-100K input tokens). Per-call traces under
interviews/vault/audit-runs/20260501T213817Z/, rollup at
interviews/vault/audit-runs/AUDIT_REPORT.md.

Three critical findings the pipeline's own gates missed:

  1. Δ=0 chains are ~98% bad (54/55 judged "bad", 54/55 judged
     "shared_scenario_for_d0_pair: no"). The lenient prompt's
     constraint that Δ=0 only fire for shared-scenario pairs didn't
     bind in practice. 6% of chains.json is affected.

  2. Gap detection is ~50% noise. 21 of 40 sampled gaps judged
     "hallucinated" — anchors don't share a scenario thread. Phase 3
     generation should pre-filter gaps before issuing the call.

  3. Pilot draft pass rate was inflated by validate_drafts.py's LLM
     judges:
       mobile-2147  accept
       edge-2536    edit (scenario truncation)
       edge-2537    REJECT (cognitive load too low for L3)
       mobile-2146  REJECT (physically absurd 0.5s/4W NPU wake-up)

Calibration findings:
  - Primary chains (n=100): 64% good, 22% weak, 14% bad
  - Secondary chains (n=100): 61% good, 33% weak, 6% bad
  - Tier delta vs primary is small at "good" — the actual quality
    cliff in secondary is concentrated in the Δ=0 subset.

No autonomous fixes filed — per agreement, audit produces findings
only. CHAIN_ROADMAP.md Progress Log spells out the three concrete
decisions for next session (drop / demote / rebuild Δ=0; pre-filter
gaps; disposition the 4 drafts per AUDIT_REPORT.md).

Total Gemini calls this session: 55 (Phase 1.4 + Phase 3 pilot + audit).
2026-05-01 18:04:36 -04:00
Vijay Janapa Reddi
bc553017b4 docs(vault): roadmap status + Phase 3 authoring conventions
D-cleanups folded into one commit:

  - CHAIN_ROADMAP.md status header reflects current state (Phase 1+2
    complete, Phase 3 pilot landed, Phase 4 mostly shipped).
  - Phase 4.1 / 4.6 / 4.7 / 4.9 entries marked complete with commit
    refs.
  - ARCHITECTURE.md gains a §3.6.1 documenting the two YAML-body
    conventions introduced when LLM-authored questions started
    landing in Phase 3:
      - _authoring private metadata block on drafts (stripped at
        promotion)
      - gap-bridge:<from>-<to> tag added at promotion for traceability
    Neither is schema-enforced (Pydantic accepts extra); both are
    stable across the pipeline.

No code changes.
2026-05-01 17:33:36 -04:00
Vijay Janapa Reddi
de46921cfe docs(vault-cli): PHASE_3_REVIEW_GUIDE.md — human review handoff
Walkthrough for reviewing LLM-authored question drafts produced by
generate_question_for_gap.py + validate_drafts.py. Covers:

  - what each of the 5 gates catches and (critically) misses
  - what to read in what order, with watchpoints for the failure modes
    that LLM gates routinely let through (vendor-name fabrication,
    arithmetic drift, level-stamping mismatches)
  - decision tree: promote (publish vs draft), edit + retry, reject
  - exact promote_drafts.py invocations for each path
  - rough scorecard summary for the 4 pilot drafts shipped in
    a750ab7bc, ready for the user's review pass

Designed for ~10-15 min of reading per pilot batch.
2026-05-01 17:24:07 -04:00
Vijay Janapa Reddi
085bf15861 docs(vault-cli): catch up to --legacy-json → --local-json rename
dev renamed the vault-cli flag in 2b381bb949 (the flag is the staffml
frontend's local-dev fallback for reading corpus.json from disk, not
deprecated path — "local-json" reads correctly in scripts and docs).
Merge of origin/dev (5c5af75ed) brought the new name in but the
roadmap + README still referenced the old one.

  - README.md: 1 replacement in the chain-pipeline runbook footer
  - CHAIN_ROADMAP.md: 8 replacements across resume instructions,
    phase runbooks, and progress-log validator lines

Historical text inside log entries is otherwise unchanged — those
record what was true at commit time. Forward-looking instructions
now use the current flag name.
2026-05-01 17:13:11 -04:00
Vijay Janapa Reddi
bf70e7686f feat(vault): Phase 3 pilot — 5 gaps generated, 4 promoted as drafts
Pilot run of the Phase 3 authoring tooling on a 5-gap subset (sized
down from the roadmap's 30 to keep wall-time + Gemini-call budget
reasonable for an unsupervised run).

Pilot scope:
  Selected 5 high-value gaps from gaps.proposed.lenient.json — buckets
  with ≥4 published questions, biased toward low-density tracks. All 5
  picks landed in edge/mobile.

Phase 3.c — generate (5/5 written):
  edge-2535  edge/latency-decomposition L?→L3
  edge-2536  edge/pruning-sparsity L?→L4
  edge-2537  edge/tco-cost-modeling L?→L3
  mobile-2146  mobile/duty-cycling L?→L3
  mobile-2147  mobile/model-format-conversion L?→L2

Phase 3.b validation — 4/5 pass (80% — above roadmap's 60-75% target):
  edge-2535: FAIL on originality (cos=0.933 vs edge-1883, threshold 0.92)
  edge-2536: pass on all 4 gates
  edge-2537: pass on all 4 gates
  mobile-2146: pass on all 4 gates
  mobile-2147: pass on all 4 gates

The originality gate correctly caught a draft that was too similar
to one of its bridge anchors — exactly the failure mode it was
designed for. Gates were run on schema (Pydantic), originality
(BAAI/bge-small-en-v1.5 cosine vs in-bucket neighbours, threshold
0.92), level_fit (Gemini-judge against same-level exemplars),
coherence (Gemini-judge), and bridge (Gemini-judge against the gap
anchors).

Phase 3.d — promotion (4 passing drafts):
  - .yaml.draft → .yaml rename
  - _authoring stripped; replaced with proper schema fields:
      provenance: llm-draft
      status: draft  (NOT published — gating on human review)
      authors: [gemini-3.1-pro-preview]
      human_reviewed: { status: not-reviewed }
      tags: + gap-bridge:<from>-<to>
  - id-registry.yaml appended (append-only ledger preserved)
  - edge-2535.yaml.draft kept in place for the human reviewer's
    disposition (rewrite + retry vs delete)

Validation post-promotion:
  - vault check --strict: 10,705 loaded (was 10,701; +4 ✓), 0 failures
  - vault build --legacy-json: released set unchanged
    (status=draft excluded by release-policy.yaml's published filter)
    — releaseHash and chainCount intentionally stable until human
    review flips status

Phase 3.e (chain rebuild) deferred: drafts must clear human review
and flip to status: published before they're eligible for chain
membership. Runbook in CHAIN_ROADMAP.md Progress Log.

Cost: 5 generation + 15 judge = 20 Gemini calls.
2026-05-01 13:38:18 -04:00
Vijay Janapa Reddi
604869b986 feat(vault-cli): Phase 3.a + 3.b — gap-driven authoring tooling
Two new scripts that together close the loop from a gap entry to a
reviewable candidate question with a multi-gate scorecard.

generate_question_for_gap.py (3.a):
  - Reads a gap entry, loads between-questions + same-bucket exemplars,
    prompts gemini-3.1-pro-preview, runs Pydantic Question validation,
    and writes <track>/<area>/<id>.yaml.draft. The .draft suffix keeps
    drafts out of vault check / vault build until promotion.
  - ID allocator scans corpus + existing drafts so a batch run gets
    distinct fresh IDs without touching id-registry.yaml.
  - Modes: --gap-index, --gaps-from + --limit, --dry-run.

validate_drafts.py (3.b):
  - Five gates per draft: schema (Pydantic), originality (cosine vs
    in-bucket neighbours via BAAI/bge-small-en-v1.5; matches the corpus
    embeddings.npz so values are comparable; cutoff 0.92), level_fit
    (Gemini-judge against same-level exemplars), coherence
    (Gemini-judge: scenario/question/solution consistency), and bridge
    (Gemini-judge: chain-fit between the gap's two anchors).
  - Final verdict pass iff every non-skipped gate passes.
  - Skips: --no-originality, --no-llm-judge.
  - Output: interviews/vault/draft-validation-scorecard.json.

Smoke checks:
  - 3.a --dry-run --gap-index 0: resolves gap, builds prompt, allocates
    cloud-4579. Synthetic Gemini response Pydantic-validates clean.
  - 3.b on a synthetic /tmp draft: schema + originality pass (top
    neighbour cosine 0.73 vs 0.92 threshold).

Phase 3.c (pilot run on 30 gaps) deferred: it generates new YAML
question content that needs human review before promotion. The
tooling ships ready; running it is a user-supervised step.

CHAIN_ROADMAP.md Progress Log + Phase 3 status updated.
2026-05-01 11:31:06 -04:00
Vijay Janapa Reddi
bff166bb9b docs(vault-cli): roadmap log — 4.2 audit (no-op) + 4.8 ship + status
- 4.2: audited multi-chain memberships; 0 qids in >1 chain because
  the lenient sweep was scoped to uncovered buckets (no overlap with
  primary). Deferred the focused playwright test until Phase 3
  authoring makes the case live.
- 4.8: marked complete; cross-ref to f086b6f42.
- Header timestamp + status snapshot updated.
2026-04-30 20:27:24 -04:00
Vijay Janapa Reddi
9680e8e9fd feat(vault+staffml): Phase 2 — tier surfacing, schema → TS → UI
Carries the primary/secondary chain tier (from Phase 1) through the
build pipeline into the practice + explore surfaces, so primary chains
are the unmarked default and secondary chains are an opt-in alternative
path the user can deep-link into via ?chain=<id>.

Backend (2.1):
  - legacy_export.py emits chain_tiers per question alongside chain_ids
    and chain_positions; missing chain-tier defaults to "primary".
  - vault build re-run: 2953 chained questions, all carry chain_tiers
    (releaseHash unchanged — new field is additive, doesn't perturb the
    manifest hash inputs).
  - Existing legacy_export tests were stale (asserted on the v1.0 YAML
    chains: field path; v1.1 made chains.json the sidecar source).
    Rewrote them to write chains.json fixtures into tmp_path and added
    chain_tiers assertions, plus a focused
    test_chain_tiers_emitted_per_membership case.

TypeScript (2.2):
  - Question.chain_tiers? (Record<string, "primary"|"secondary">)
  - ChainTier export, ChainInfo.tier required.
  - getChainForQuestion / getAllChainsForQuestion populate tier;
    getAllChains... sorts primary first.
  - New getPrimaryChainForQuestion(qid) helper for default surfaces.

UI (2.3):
  - practice page reads ?chain=<id> URL param; defaults to
    getPrimaryChainForQuestion when unset.
  - ChainBadge gains an inline "alt path" pill when tier=secondary
    (always visible — no click needed).
  - ChainStrip mirrors that pill in the progress row for users who
    expand the strip.
  - Explore page prefers the first non-secondary chain when picking
    activeChainId for the related-questions panel.
  - Deferred to a follow-up commit (intentional, scoped via Progress Log):
    explore-page "Primary only / All" filter; daily/mock routing.

Tests (2.4):
  - test7_tier_aware_chain_routing in chain-and-vault-smoke.mjs:
    secondary reachable via ?chain=, alt-path badge visible on
    secondary, primary regression, alt-path badge ABSENT on primary.
  - Full smoke suite: 17/17 pass (was 13/13).

Validation:
  - vault check --strict: 10,701 loaded, 0 failures
  - vault build --legacy-json: 9438 published, chainCount=879
  - pytest interviews/vault-cli/tests: 74/74
  - npx tsc --noEmit: 0 errors
  - playwright chain-and-vault-smoke: 17/17

Phase 2 complete. Next: Phase 3 (gap-driven authoring; 407-gap backlog).
2026-04-30 20:22:54 -04:00
Vijay Janapa Reddi
83fe0f7193 feat(vault): Phase 1 — second-pass chain coverage build (373 → 879)
Diagnoses uncovered (track, topic) buckets and runs a relaxed Gemini
sweep targeting them. New chains tier="secondary"; pre-existing chains
backfilled tier="primary".

Tools (Phases 1.1, 1.2/1.3, 1.5):
  - diagnose_chain_coverage.py: surface buckets with no chains
    (committed earlier on yaml-audit)
  - build_chains_with_gemini.py: --mode lenient adds Δ ∈ {0,1,2,3}
    (committed earlier on yaml-audit)
  - merge_chain_passes.py: merges primary + secondary, enforces the
    multi-membership cap (max 2 chains/qid; non-L1/L2 capped at 1)

Sweep (Phase 1.4):
  - 17 Gemini-3.1-pro-preview calls, ~22 min wall time, 211 buckets
  - 506 chains accepted (above the 200-400 estimate), 269 new gaps
  - validator caught a few cross-bucket and Δ=4 hallucinations inline
  - Δ distribution: Δ=1 69.1%, Δ=2 21.1%, Δ=3 4.6%, Δ=0 5.2%
    (10.9% of chains contain at least one Δ=0 — within target band)
  - random spot-check of 5 Δ=0 chains: all share scenario threads
    (DMA, CMSIS-NN, on-device routing, PB-scale pipelines)

Coverage gains (chains/topic before → after):
  - cloud   2.95 → 4.37   (242 + 116 secondary)
  - edge    0.64 → 2.59   ( 49 + 148 secondary)
  - mobile  0.74 → 2.56   ( 46 + 113 secondary)
  - tinyml  0.80 → 2.64   ( 36 +  83 secondary)
  - global  0.00 → 0.96   (  0 +  46 secondary)
  Buckets with ≥1 chain: 102 / 313 (33%) → 285 / 313 (91%).

Validation:
  - apply_proposed_chains.py --dry-run: validation clean (879 chains)
  - vault check --strict: 10,701 loaded, 0 invariant failures
  - vault build --legacy-json: chainCount 373 → 879, release_hash
    rolled to 04ee8a23…
  - playwright chain-and-vault-smoke.mjs: 13/13 pass

Phase 1 complete. Next: Phase 2 (tier surfacing in staffml UI).
2026-04-30 20:12:27 -04:00
Vijay Janapa Reddi
af5f25f543 feat(vault-cli): diagnose_chain_coverage.py — surface buckets needing chains
Loads the published corpus (via vault_cli.policy — single source of truth)
and chains.json, buckets by (track, topic), and emits chain-coverage.json
with two cuts:
  - uncovered_buckets: ≥3 questions, 0 chains
  - under_covered_buckets: ≥6 questions, ≤1 chain
Plus per-track summary + top-10 uncovered for quick read.

Output is gitignored — regeneratable, fed to Phase 1.4's --buckets-from.

Phase 1.1 of CHAIN_ROADMAP.md. See progress log for the run results
(211 uncovered buckets, edge/mobile/tinyml chain density 0.6-0.8 vs
cloud's 2.95, biggest miss is cloud:roofline-analysis at 144q/0 chains).
2026-04-30 18:15:59 -04:00
Vijay Janapa Reddi
671b37b37b docs(vault-cli): CHAIN_ROADMAP.md — resumable plan for chain coverage workstream
Canonical document for the multi-phase chain growth plan. Future Claude
sessions read this first to resume exactly where the previous session
left off.

Structure:
  - Resume Here: how to verify state + pick up the next step
  - Current state snapshot: validators, counts, branch tip
  - Phase 1: second-pass coverage build (373 -> ~700 chains)
  - Phase 2: tier surfacing in schema + UI
  - Phase 3: gap-driven authoring (using gaps.proposed.json)
  - Phase 4: misc parallel items (CI gates, multi-chain UI, etc.)
  - Recommended execution order over 4 weeks
  - Progress Log: append-only notes after each step

Initial Progress Log entry captures session state through commit
1ac7d4c56 (Gemini chain rebuild applied, 373 chains live).
2026-04-30 17:38:13 -04:00
Vijay Janapa Reddi
c824ac6ed1 refactor(staffml): retire prod static-fallback; opt-in dev-only (#1598)
The bundled corpus.json was serving as a prod safety net behind the
Cloudflare Worker. Post-cutover the Worker has been the real data
source, and the static path was silently degrading rather than helping
(corpus.json is a generated artifact whose prose `details` are blank
in corpus-summary.json). This change:

- Stops emitting corpus.json in the publish-live workflow
- Removes the Worker-error fallback in getQuestionFullDetail — errors
  now propagate to useFullQuestion and the UI shows a "details
  unavailable" banner instead of silently filling blanks
- Drops the localhost auto-trigger in shouldUseStaticDetails — the
  static path now requires explicit NEXT_PUBLIC_VAULT_FALLBACK=static
- Switches taxonomy.ts to corpus-summary.json (was corpus.json)
- Rewrites the publish-live smoke tests against corpus-summary.json
- Collapses validate-vault.py to sparse-only (per-question deep
  validation lives in `vault check --strict`)

Static-fallback remains as an OPT-IN local-dev affordance: set
NEXT_PUBLIC_VAULT_FALLBACK=static and run `vault build --legacy-json`
to materialize corpus.json. The Function-constructor dynamic import
keeps Turbopack from requiring corpus.json at build time.

useFullQuestion hook signature changed from `Question | undefined` to
`{ question, status }`. Callers updated: practice and plans pages
(both render an amber "details unavailable" banner when status
is 'error').

Deleted dead cutover scaffolding: corpus-source.ts (router with no UI
consumers), corpus-vault.ts (worker-only mirror, never wired up),
useVaultQuestion.ts (unused migration hook), vault-fallback.ts (only
consumer was corpus-source.ts).

Deleted stale docs: staffml/scripts/DEPRECATED.md, vault-cli/docs/
CUTOVER_QA.md, three vault/docs/RESUME_PLAN_*.md.

Verified locally: tsc clean, vitest 37/37, next build produces all
15 static routes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:47:03 -04:00
Vijay Janapa Reddi
5131cb28fc docs: R11 stability cleanup + v2.6 \u2014 11 rounds, convergence declared
R11 (David, fresh-eyes stability check): 0 Critical + 0 High + 1 Medium
(doc cleanup from R10-F-2 closure itself).

R11-M-1 (MEDIUM): CUTOVER_QA.md + vault-cli/README.md still referenced
--canary-percent flag after R10-F-2 removed it from code + ARCHITECTURE.md.
Operator following CUTOVER_QA.md step 1 of cutover day would hit
'Error: no such option --canary-percent' \u2014 the one document whose
entire purpose is cutover correctness.

Fix: CUTOVER_QA.md \u00a71 replaces canary-staged rollout with all-or-nothing
ship language + Phase-7-deferred note pointing at \u00a74.3. README.md:57
drops [--canary-percent N] from the ship example.

STABILITY DECLARED after R11. Three consecutive rounds (R7, R8, R11) with
zero new Criticals. R11 explicit: 'convergence confirmed.'

Finding-density trajectory across 11 rounds (new Criticals per round):
  R1: 3, R2: 1, R3: 2, R4: 3, R5: 3, R6: skipped,
  R7: 0, R8: 0, R9: 1* (regression-detect, not new), R10: 0, R11: 0

Total findings closed across all rounds: ~120.
No further rounds scheduled.

ARCHITECTURE.md header bumped v2.5 \u2192 v2.6.
REVIEWS.md adds 'Rounds 7\u201311' section with per-round finding counts,
notable findings, meta-observation on R9 (tooling/persistence issue
Gemini caught that individual-file reviewers couldn't), and the
convergence signal.
2026-04-16 16:42:39 -04:00
Vijay Janapa Reddi
f25f9e8184 feat(vault): B.1-B.7 + B.13 + B.15 + B.17 \u2014 finish bucket B
Worker hardening (interviews/staffml-vault-worker/src/index.ts rewritten):
- B.1 Cloudflare Cache API wired via caches.default; cache key is
  /__vault__/<release_id>/<path> so each release is a disjoint namespace.
  Deploy changes release_id \u2192 all old entries miss atomically. Degraded
  responses are NEVER cached (would poison the namespace).
- B.3 Keyset pagination: cursor is {after_id, filter_hash}. Server
  computes filter_hash per-request and rejects cross-filter cursor reuse
  with 400. Pagination cost drops from O(offset + N) to O(N) per page.
- B.4 Rate limiting via RATE_LIMIT_KV (src/rate_limit.ts): token bucket
  per (IP, class) windowed at 60s. 'default' 60 rpm, 'search' 10 rpm.
  Returns 429 with Retry-After header. Open-allows if KV not bound so
  the local vault-api shim still works.
- /search uses FTS5 MATCH when questions_fts exists; fallback to LIKE
  for pre-FTS5 D1 instances. Escapes FTS5 special chars to prevent
  MATCH injection.

vault-api.ts circuit breaker (B.2 \u2014 Soumith R3-F-2 fix):
- Proper closed \u2192 open \u2192 half-open state machine. Half-open admits
  exactly one probe; failure \u2192 re-open immediately, success \u2192 close.
- AbortSignal.timeout(10_000) per-attempt; AbortSignal.any() combines
  with caller's signal so React unmounts don't count as failures.
- Retry only on retryable statuses (408/425/429/5xx/network), not on
  4xx user errors or caller-aborted fetches.
- Module-level _singleton so multiple makeClientFromEnv() share breaker
  state. __resetSingleton() exposed for tests.

Worker vitest suite (B.6 \u2014 staffml-vault-worker/tests/worker.test.ts):
6 tests: rate-limit under/over cap with Retry-After; schema-fingerprint
placeholder forces degraded mode; real fingerprint clears flag;
cursor filter_hash mismatch returns 400; CORS echoes allowed origin;
405 on POST/PUT/DELETE; /admin/release returns 404 (no auth footgun).

vault ship real hooks (B.15 \u2014 commands/release.py):
- d1_forward: pnpm exec wrangler d1 execute <env-db> --file <migration.sql>
- d1_rollback: applies d1-rollback.sql (SQL path); snapshot path remains
  primary per \u00a76.2.
- nextjs_forward: pnpm run deploy:<env> from site_dir.
- nextjs_rollback: pnpm exec wrangler pages deployment list (lets operator
  pick rollback target).
- paper_forward: git tag -a v<version> && git push origin v<version>.
- --skip-legs allows shipping subset (e.g., skip=paper for pre-tag validation).

Content-hash SLI workflow (B.5 \u2014 .github/workflows/vault-content-hash-sli.yml):
Hourly GitHub Action samples 20 IDs from latest release's vault.db,
fetches same IDs from production worker, recomputes canonical content_hash
in Python, asserts parity. Files a priority-high issue on mismatch.
Avoids porting hashing.py canonicalization to TypeScript (Chip R3-H5's
invariant-bomb risk).

JSON schemas (B.7 \u2014 vault-cli/docs/JSON_OUTPUT.md):
Full stable shapes for build, publish, ship, new, rm, move, renumber,
restore, promote, mark-exemplar, snapshot, migrations-emit, export-paper,
tag, deploy, rollback, generate. Plus notes for serve/api (not
JSON-emitting \u2014 long-running servers).

Codegen hash baseline (B.13 hash-check variant):
vault codegen --check now computes SHA-256 over 3 shared artifacts and
compares to committed interviews/vault-cli/codegen-hashes.txt. First run
auto-records baseline; subsequent runs enforce no drift. Full LinkML-driven
regeneration remains a Phase-2 follow-up. Baseline recorded this commit.

Component migration hook (B.17 \u2014
staffml/src/lib/hooks/useVaultQuestion.ts):
Minimal React hook that routes through corpus-source.ts. Components opt
into the cutover by importing from here; existing corpus.ts callers remain
untouched. Cutover-day swap is one import per component, not a big-bang
replacement.

28/28 pytest still green. release_hash 1b304282... unchanged (no
content-affecting mutations).
2026-04-16 14:04:03 -04:00
Vijay Janapa Reddi
6dff01c065 docs(vault): Phase 0 documentation deliverables
EVOLUTION.md (fixes H-1 from REVIEWS.md)
  Schema-version rules: SemVer semantics (additive-minor implicit,
  breaking-major bumps schema_version). Loader contract across
  versions. vault migrate-schema mechanics: parallel tree, forward/
  rollback functions, --dry-run, failure log. Mixed-version PRs
  forbidden — CI rejects. Canonicalization-version (CANON_VERSION)
  bumps separate from schema_version. Historical record stub.

EXIT_CODES.md
  Stable exit-code taxonomy table with rationale for each category
  (0 vs 1, 1 vs 2, 3 vs 4, 5 as user-abort). Usage in code, tests,
  JSON output. Evolution policy: add new codes, never renumber.

JSON_OUTPUT.md
  Common envelope: {ok, exit_code, exit_symbol, command,
  cli_version, data, errors, warnings}. Per-command schemas for
  check, stats, verify, doctor, diff. LSP-diagnostic shape for
  check errors. --json-schema meta-command prints per-command
  JSON Schema.

CONTRIBUTING.md (fixes H-17)
  Quick-start path from clone → local site serving a question in
  ≤10min target. What can be contributed, workflow, PR review.
  Provenance-honesty rules. Author attribution via
  vault/contributors.yaml. Phase-by-phase scope of what works today
  vs what lands later.

All four are referenced directly from ARCHITECTURE.md sections.
2026-04-15 21:25:52 -04:00
Vijay Janapa Reddi
eaca50116a docs(vault): detailed testing plan and cutover QA checklist
TESTING.md fleshes out ARCHITECTURE.md §19 with concrete inventory:
- Test pyramid: unit / integration / CLI contract / data-migration /
  equivalence / codegen-drift / worker contract / E2E Playwright /
  smoke / load / rollback / SLI probes.
- Fixtures: 20-question frozen corpus, golden vault.db, 15 schema-
  drift fixtures covering each invariant class, cross-release fixtures
  for migrations.
- Per-layer test file inventory (every vault subcommand has a named
  contract test).
- CI workflow spec: .github/workflows/vault-ci.yml for every PR plus
  nightly and deploy workflows. PyYAML + Python pinned for hash
  stability.
- Phase-entry gate table: what testing artifacts block each phase
  transition.
- Observability + rollback protocol for Phase 4.

CUTOVER_QA.md: expanded from §19.4 into a sequential operator
runbook. Pre-cutover gates, vault ship canary stages, 8 flow checks
(home, practice, gauntlet, progress, about, search, chain UX,
offline), network/bundle verification, rollback drill (rehearsed
on staging first), 48h post-cutover watch cadence, rollback-trigger
conditions, post-cutover sign-off checklist.

Both docs are living — TESTING.md evolves as CLI surface grows;
CUTOVER_QA.md is versioned per release.
2026-04-15 18:07:27 -04:00