cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-07-16 23:24:55 -05:00

Author	SHA1	Message	Date
Vijay Janapa Reddi	f12d303769	chore(interviews): purge stale AI prompts and dev scratch from interviews/ Remove ten files from the public repo that should never have been tracked. Verified no code references any of them before deleting. AI-prompt files (private to author tooling, do not belong in the public repo): - interviews/vault-cli/docs/GEMINI_SELF_AUDIT_PROMPT.md - interviews/vault/_pipeline/runs/gemini-self-audit/prompts/{cloud, edge,global,mobile,tinyml}_audit_prompt.md (5 per-track prompts; interviews/vault/.gitignore already excludes /_pipeline/, but these five were force-added in `f6c41d7689` before the rule was set) Dev-scratch artifacts (clearly leftover dev iteration; filenames literally say 'final' four different ways): - interviews/vault-cli/check_results_absolute_final.json - interviews/vault-cli/check_results_after_repair.json - interviews/vault-cli/check_results_final.json - interviews/vault-cli/check_results_total_final.json No production code, tests, docs, or CI references any of these paths. The audit-pipeline scripts that would write into _pipeline/ already respect the existing gitignore rule for that directory tree.	2026-05-05 10:51:53 -04:00
Vijay Janapa Reddi	e465587959	docs(vault-cli): GEMINI_SELF_AUDIT_PROMPT.md — agentic audit via gemini CLI A self-contained prompt that lets gemini CLI walk the corpus and audit it directly via its own filesystem tools, without the audit_corpus_batched.py Python wrapper. Useful when the wrapper hits rate-limit / exit-55 walls or when the operator wants Gemini to checkpoint to disk as it goes. The prompt uses an append-only JSONL output at interviews/vault/_pipeline/runs/gemini-self-audit/01_audit.jsonl with resume semantics (re-running skips qids already in the file). Encodes the same five gates as audit_corpus_batched.py (format_compliance, level_fit, coherence, math_correct, title_quality) plus a stable JSON shape so downstream tooling can consume it identically. Includes invocation guidance: --yolo + --skip-trust, slice by track to avoid the multi-hour serial walk, resume across sessions.	2026-05-04 10:36:31 -04:00
Vijay Janapa Reddi	ac2c7b39eb	docs(vault-cli): PHASE_5_UNRESOLVED.md — post-drain accounting Reflects the 2026-05-04 follow-up slices: math-skip-level (15 applies) and math-finish queue drain (66 applies). Cumulative now 2,372 of 2,757 (86.0%); 385 known-deferred ahead of Phase 6. Also corrects the original doc's '70 already-applied no-ops' line — those were unverified math candidates the verify guard skipped, not no-ops.	2026-05-04 08:14:16 -04:00
Vijay Janapa Reddi	2dc556e1e5	docs(vault-cli): PHASE_6_HANDOFF.md — resume guide after Phase 5 mass-apply Self-contained resume guide for the next session: - Confirms Phases 0-5 (autonomous) + 8 done - Documents 478 unresolved corrections (cross-refs PHASE_5_UNRESOLVED) - Step-by-step for Phase 5 cleanup → Phase 6 schema → Phase 7 verify → Phase 9 release - Concrete CLI commands for each step (vault audit review with --filter-gate flags, vault codegen, vault publish) - Reference doc map (which doc covers what) - Pipeline data layout (where the canonical 01_audit.json lives) - Full commit log from this session - Merge command to land yaml-audit on dev when ready - Paste-ready resume prompt for the next Claude Code session Total estimated remaining work to ship vault 1.0.0: ~9h, mostly Phase 5 review + Phase 6 schema. Tree is clean; ready to hand off.	2026-05-04 07:14:47 -04:00
Vijay Janapa Reddi	79b4c3361e	docs(vault-cli): PHASE_5_UNRESOLVED.md — list of corrections needing human review After the autonomous Phase 5 mass-apply + math-verify passes, 2,279 of 2,757 corrections (82.6%) were auto-applied. The remaining 478 were deliberately not applied because they fail one of three safety checks: 75 math 'no' — independent Gemini check disputed the fix 14 math 'unclear' — Gemini wasn't confident 13 math + level-block — fix has level relabel that breaks a chain 168 relabel-up — against CORPUS_HARDENING_PLAN.md §10 Q3 138 chain-block — would break chains.json monotonicity 70 already-applied — no action needed This doc: - Summarizes the skip reasons + counts - Points to the disposition logs in _pipeline/runs/ - Recommends a per-category review workflow - Notes which categories are highest priority (math 'no') - Notes which are chain-restructuring decisions (out of Phase 5 scope) Reviewer flow uses `vault audit review` (apply_corrections.py wrapper) with --filter-gate to target specific buckets. Phase 5 autonomous portion is COMPLETE. Phase 6 (schema tightening) remains safe to attempt once the 478 are dispositioned or accepted as known-deferred.	2026-05-03 19:17:46 -04:00
Vijay Janapa Reddi	9ee3c34303	docs(vault-cli): PHASE_4_HANDOFF — update post-backfill Append 2026-05-03 update reflecting: - Phase 4 backfill complete (2,757 corrections proposed; 0 errors) - 6 cloud questions migrated (stray top-level MCQ fields → details) - Phase 8 CLI subcommand shipped (vault audit run/review/summarize/merge) Next session can skip Step 1 (backfill — done) and start at Step 2 (Phase 5 interactive review).	2026-05-03 18:39:37 -04:00
Vijay Janapa Reddi	87481ab6a3	docs(vault-cli): refresh AUDIT_FINDINGS_2026-05-03 after Phase 4 backfill After running --propose-fixes backfills on cloud + edge failures, the canonical merged audit dataset now has: - 9,446 questions audited (100%) - 0 errors (all retried clean) - 2,757 with suggested_corrections (up from 1,767; +990 new fixes from cloud + edge backfills) Per-track: cloud: 4,028 / 851 with fixes edge: 2,079 / 669 global: 313 / 90 mobile: 1,824 / 677 tinyml: 1,202 / 470 Phase 5 (apply_corrections.py interactive review) can now begin on the 2,757-row subset. CORPUS_HARDENING_PLAN.md Phase 4 backfill complete.	2026-05-03 18:38:55 -04:00
Vijay Janapa Reddi	87adaeec2f	docs(vault-cli): PHASE_4_HANDOFF.md — resume guide for next session Self-contained instructions for picking up where Phase 4 left off: - Sanity checks for the worktree (vault check, pytest, ruff) - Phase 4 backfill steps (cloud + edge --propose-fixes, retry global errors, re-merge, regenerate AUDIT_FINDINGS) - Phase 5 review workflow (apply_corrections.py with --filter-gate + --auto-accept-format for low-risk fixes) - Phase 6 schema-tightening checklist (LinkML pattern constraints, Details extra="forbid", lift gate into validator) - Phase 7 title verification - Phase 8 vault audit CLI subcommand (cron is already shipped) - Phase 9 release pipeline Includes: - Concrete CLI commands for every step - Cost estimates per step - Tooling reference (which script does what) - Open questions from CORPUS_HARDENING_PLAN.md §10 still to decide - Full commit log from this session - Troubleshooting (rate limits, codespell, scratch files) Total estimated time to ship vault 1.0.0: ~12h, mostly Phase 5 human review. Spread over 2-3 working days. CORPUS_HARDENING_PLAN.md Phase 4 → Phase 5 transition.	2026-05-03 14:31:45 -04:00
Vijay Janapa Reddi	d2621cc9ed	feat(vault-cli): merge_audit_runs.py + Phase 4 findings doc merge_audit_runs.py — merges multiple per-track audit_corpus_batched output dirs into one canonical run. Per-qid prefer non-error rows, then rows with suggested_corrections. AUDIT_FINDINGS_2026-05-03.md — first complete corpus audit. summarize_audit.py — truncate rationale snippets at word boundaries (was truncating mid-word, tripping codespell on words like 'claimin'). Phase 4 final stats (9,446 published questions audited): format_compliance: ~960 fail level_fit: ~1,580 fail coherence: ~480 fail math_correct: ~330 fail title_quality: ~250 placeholder + ~25 malformed 20 error rows in global to retry on next run 1,767 questions have suggested_corrections; ~1,500 more need a propose-fixes backfill pass (mostly cloud, some edge). CORPUS_HARDENING_PLAN.md Phase 4 finalization.	2026-05-03 14:26:37 -04:00
Vijay Janapa Reddi	36f2ef5929	docs(vault-cli): CORPUS_HARDENING_PLAN.md — supersedes RELEASE_AUDIT_PLAN.md End-to-end plan for taking the published-corpus audit from "stratified sample at ~2,900 calls / 12 days" to "full corpus at ~450 calls / ~3 days". The previous plan over-budgeted by 6× because it assumed 1-call-per-gate-per-question; switching to batched 30-questions-per-call collapses the cost. Nine phases, 27 testable acceptance criteria. End state: every published YAML conforms to a strict schema with load-time-enforced format markers (Pitfall/Rationale/Consequence + Assumptions/Calculations/Conclusion); math, level-fit, coherence, vendor fabrication, and physical realism are independently Gemini-verified at corpus scale; new violations are caught at vault check --strict time and cannot silently land. Major design choices: - Audit + corrections in one tool (audit_corpus_batched.py), with a --propose-fixes mode whose suggestions are NEVER auto-applied — humans review via apply_corrections.py. - Schema tightening AFTER cleanup, not before (Phase 6 lifts pattern constraints into LinkML / Pydantic only once Phase 5 has cleaned the corpus, so the new constraints reject nothing real). - Cron the audit (Phase 8) so findings become a routine artifact. - AUTHORING.md + vault new scaffold (Phase 2) so new contributors see the format conventions before authoring, not after CI catches them.	2026-05-03 07:43:47 -04:00
Vijay Janapa Reddi	963fbfb162	docs(vault-cli): RELEASE_AUDIT_PLAN.md — handoff for fresh-session corpus audit Captures the release-readiness state of the vault and the plan for finishing the audit work the 250/day Gemini cap has constrained. Corpus health survey (9,446 published questions, no Gemini cost): - 100% schema-valid (Pydantic) - 90.9% format-compliant (Pitfall/Rationale/Consequence + Assumptions/ Calculations/Conclusion markers) - 9.1% fail format compliance (861 questions; mechanical fixes) - 134 placeholder titles (all global/* "Global New NNNN") - 407 with provenance: None (should be "imported") - 95.3% canonical bold-marker napkin_math; 4.7% partial / bullet-only Template gap noted: vault new scaffolds only scenario + solution stubs; the Pitfall/Rationale/Consequence and Assumptions/Calculations/Conclusion templates are encoded ONLY in the generation prompt and the format-compliance regex. There's no human-readable AUTHORING.md. The new session is asked to ship one. The plan: stratified sample of 1,000 questions (33 per track × level cell) with full Gemini gate suite (math + coherence + level_fit + bridge) at ~2,900 calls across ~12 days at the 250/day cap. Full-corpus audit (~27,400 calls / ~110 days) is infeasible; sampling captures any failure mode at >5-10% rate. Includes: - Concrete numbers from the corpus survey (failure counts by category) - Day-by-day execution plan with resume instructions - Daily cost-ledger format - Stopping rules - Post-audit cleanup → paper.tech update path - Mechanical (no-Gemini) cleanups the new session can do in parallel with the daily audit cycle (provenance fix, format markers, AUTHORING.md) CHAIN_ROADMAP.md Progress Log entry points the resumable cursor at this plan.	2026-05-02 11:29:57 -04:00
Vijay Janapa Reddi	2b3cf5e1da	chore(vault): consolidate AI pipeline artifacts under _pipeline/ Establishes one ignored subdirectory for ALL intermediate outputs of LLM-driven tooling (chain proposals, gap detection, draft scorecards, audit traces). Single gitignore rule: /_pipeline/. Convention is documented in interviews/vault/README.md under "Pipeline artifacts" — it's a real project layout convention, not AI-specific config. Path migration: interviews/vault/chains.proposed.json → _pipeline/chains.proposed.json interviews/vault/gaps.proposed.json → _pipeline/gaps.proposed.json interviews/vault/draft-validation-scorecard.json → _pipeline/draft-validation-scorecard.json interviews/vault/audit-runs/ → _pipeline/runs/ 8 scripts updated to define a PIPELINE_DIR constant and route default outputs through it: build_chains_with_gemini.py, apply_proposed_chains.py, merge_chain_passes.py, validate_drafts.py, audit_chains_with_gemini.py, generate_question_for_gap.py, summarize_proposed_chains.py, promote_drafts.py. Forward-looking docs (README.md chain-pipeline section + CHAIN_ROADMAP.md resume instructions + state snapshot) updated to reference the new paths. Historical Progress Log entries left as-is — they accurately describe what was committed at the time. Drive-by .gitignore fixes (both used full repo-relative paths under package-local .gitignore files, which never matched): interviews/vault-cli/.gitignore: scripts/.calibration_cache/ interviews/vault/.gitignore: /embeddings.npz Validation: - vault check --strict: 10,705 loaded, 0 invariant failures - pytest interviews/vault-cli/tests/: 74/74 - audit --dry-run: paths resolve correctly to _pipeline/runs/<ts>/ No durable corpus content moves. chains.json (live registry), id-registry.yaml, questions/, etc. all stay where they were.	2026-05-02 09:04:55 -04:00
Vijay Janapa Reddi	270b1a5bd2	fix(vault): drop 55 Δ=0 chains + remove Δ=0 from lenient mode Action on the strongest finding from the 2026-05-01 independent audit: 54 of 55 Δ=0 chains had no shared scenario (the "two questions sharing a scenario thread" constraint the lenient prompt was supposed to enforce). Two independent audit fields agreed (verdict=bad and shared_scenario=no), so this isn't a tuning question — the design choice was wrong. Why remove Δ=0 entirely rather than tighten the prompt: - The chain definition is "pedagogical progression through Bloom levels"; same-level edges contradict the definition. - The "shared scenario / different angle" carve-out is unenforceable by an LLM at corpus scale (audit confirmed). - Same-scenario same-level pairs are more honestly modeled as siblings of a chain anchor, not as chain members. Changes: - chains.json: 879 → 824. Dropped: 55 chains (all tier=secondary, since Δ=0 was only ever produced by the lenient sweep). Per-track: edge -19, tinyml -12, mobile -10, cloud -7, global -7. - build_chains_with_gemini.py: MODE_CONFIG["lenient"]["allowed_deltas"]: {0,1,2,3} → {1,2,3} LENIENT_PROMPT_TEMPLATE: Δ=0 paragraph rewritten to explicitly REJECT same-level pairs (with rationale citing the audit). docstring + --mode help text updated. - tests/test_chain_validation.py: test_lenient_accepts_same_level_pair → test_lenient_rejects_same_level_pair header docstring updated to reflect the new rule. - vault-manifest.json: chainCount 879 → 824, releaseHash rolls to 479811040b7a… (real content delta, not a timestamp churn). Validation: - vault check --strict: 10,705 loaded, 0 failures - vault build --local-json: chainCount=824, releaseHash=479811040b… - pytest: 74/74 - playwright chain-and-vault-smoke: 19/19 (fixtures cloud-0001 + cloud-0231 are still in their chains post-drop) Audit findings #2 (gap detection ~50% noise) and #3 (4 pilot drafts disposition) remain open — see CHAIN_ROADMAP.md Progress Log.	2026-05-02 08:51:49 -04:00
Vijay Janapa Reddi	b68f6dbf83	audit(vault): independent Gemini audit — 18 calls, 3 critical findings Ran audit_chains_with_gemini.py end-to-end. 18 Gemini-3.1-pro-preview calls (well under the 250/day cap) sized to 80-336K char prompts (the attention sweet spot at ~80-100K input tokens). Per-call traces under interviews/vault/audit-runs/20260501T213817Z/, rollup at interviews/vault/audit-runs/AUDIT_REPORT.md. Three critical findings the pipeline's own gates missed: 1. Δ=0 chains are ~98% bad (54/55 judged "bad", 54/55 judged "shared_scenario_for_d0_pair: no"). The lenient prompt's constraint that Δ=0 only fire for shared-scenario pairs didn't bind in practice. 6% of chains.json is affected. 2. Gap detection is ~50% noise. 21 of 40 sampled gaps judged "hallucinated" — anchors don't share a scenario thread. Phase 3 generation should pre-filter gaps before issuing the call. 3. Pilot draft pass rate was inflated by validate_drafts.py's LLM judges: mobile-2147 accept edge-2536 edit (scenario truncation) edge-2537 REJECT (cognitive load too low for L3) mobile-2146 REJECT (physically absurd 0.5s/4W NPU wake-up) Calibration findings: - Primary chains (n=100): 64% good, 22% weak, 14% bad - Secondary chains (n=100): 61% good, 33% weak, 6% bad - Tier delta vs primary is small at "good" — the actual quality cliff in secondary is concentrated in the Δ=0 subset. No autonomous fixes filed — per agreement, audit produces findings only. CHAIN_ROADMAP.md Progress Log spells out the three concrete decisions for next session (drop / demote / rebuild Δ=0; pre-filter gaps; disposition the 4 drafts per AUDIT_REPORT.md). Total Gemini calls this session: 55 (Phase 1.4 + Phase 3 pilot + audit).	2026-05-01 18:04:36 -04:00
Vijay Janapa Reddi	bc553017b4	docs(vault): roadmap status + Phase 3 authoring conventions D-cleanups folded into one commit: - CHAIN_ROADMAP.md status header reflects current state (Phase 1+2 complete, Phase 3 pilot landed, Phase 4 mostly shipped). - Phase 4.1 / 4.6 / 4.7 / 4.9 entries marked complete with commit refs. - ARCHITECTURE.md gains a §3.6.1 documenting the two YAML-body conventions introduced when LLM-authored questions started landing in Phase 3: - _authoring private metadata block on drafts (stripped at promotion) - gap-bridge:<from>-<to> tag added at promotion for traceability Neither is schema-enforced (Pydantic accepts extra); both are stable across the pipeline. No code changes.	2026-05-01 17:33:36 -04:00
Vijay Janapa Reddi	de46921cfe	docs(vault-cli): PHASE_3_REVIEW_GUIDE.md — human review handoff Walkthrough for reviewing LLM-authored question drafts produced by generate_question_for_gap.py + validate_drafts.py. Covers: - what each of the 5 gates catches and (critically) misses - what to read in what order, with watchpoints for the failure modes that LLM gates routinely let through (vendor-name fabrication, arithmetic drift, level-stamping mismatches) - decision tree: promote (publish vs draft), edit + retry, reject - exact promote_drafts.py invocations for each path - rough scorecard summary for the 4 pilot drafts shipped in a750ab7bc, ready for the user's review pass Designed for ~10-15 min of reading per pilot batch.	2026-05-01 17:24:07 -04:00
Vijay Janapa Reddi	085bf15861	docs(vault-cli): catch up to --legacy-json → --local-json rename dev renamed the vault-cli flag in `2b381bb949` (the flag is the staffml frontend's local-dev fallback for reading corpus.json from disk, not deprecated path — "local-json" reads correctly in scripts and docs). Merge of origin/dev (5c5af75ed) brought the new name in but the roadmap + README still referenced the old one. - README.md: 1 replacement in the chain-pipeline runbook footer - CHAIN_ROADMAP.md: 8 replacements across resume instructions, phase runbooks, and progress-log validator lines Historical text inside log entries is otherwise unchanged — those record what was true at commit time. Forward-looking instructions now use the current flag name.	2026-05-01 17:13:11 -04:00
Vijay Janapa Reddi	bf70e7686f	feat(vault): Phase 3 pilot — 5 gaps generated, 4 promoted as drafts Pilot run of the Phase 3 authoring tooling on a 5-gap subset (sized down from the roadmap's 30 to keep wall-time + Gemini-call budget reasonable for an unsupervised run). Pilot scope: Selected 5 high-value gaps from gaps.proposed.lenient.json — buckets with ≥4 published questions, biased toward low-density tracks. All 5 picks landed in edge/mobile. Phase 3.c — generate (5/5 written): edge-2535 edge/latency-decomposition L?→L3 edge-2536 edge/pruning-sparsity L?→L4 edge-2537 edge/tco-cost-modeling L?→L3 mobile-2146 mobile/duty-cycling L?→L3 mobile-2147 mobile/model-format-conversion L?→L2 Phase 3.b validation — 4/5 pass (80% — above roadmap's 60-75% target): edge-2535: FAIL on originality (cos=0.933 vs edge-1883, threshold 0.92) edge-2536: pass on all 4 gates edge-2537: pass on all 4 gates mobile-2146: pass on all 4 gates mobile-2147: pass on all 4 gates The originality gate correctly caught a draft that was too similar to one of its bridge anchors — exactly the failure mode it was designed for. Gates were run on schema (Pydantic), originality (BAAI/bge-small-en-v1.5 cosine vs in-bucket neighbours, threshold 0.92), level_fit (Gemini-judge against same-level exemplars), coherence (Gemini-judge), and bridge (Gemini-judge against the gap anchors). Phase 3.d — promotion (4 passing drafts): - .yaml.draft → .yaml rename - _authoring stripped; replaced with proper schema fields: provenance: llm-draft status: draft (NOT published — gating on human review) authors: [gemini-3.1-pro-preview] human_reviewed: { status: not-reviewed } tags: + gap-bridge:<from>-<to> - id-registry.yaml appended (append-only ledger preserved) - edge-2535.yaml.draft kept in place for the human reviewer's disposition (rewrite + retry vs delete) Validation post-promotion: - vault check --strict: 10,705 loaded (was 10,701; +4 ✓), 0 failures - vault build --legacy-json: released set unchanged (status=draft excluded by release-policy.yaml's published filter) — releaseHash and chainCount intentionally stable until human review flips status Phase 3.e (chain rebuild) deferred: drafts must clear human review and flip to status: published before they're eligible for chain membership. Runbook in CHAIN_ROADMAP.md Progress Log. Cost: 5 generation + 15 judge = 20 Gemini calls.	2026-05-01 13:38:18 -04:00
Vijay Janapa Reddi	604869b986	feat(vault-cli): Phase 3.a + 3.b — gap-driven authoring tooling Two new scripts that together close the loop from a gap entry to a reviewable candidate question with a multi-gate scorecard. generate_question_for_gap.py (3.a): - Reads a gap entry, loads between-questions + same-bucket exemplars, prompts gemini-3.1-pro-preview, runs Pydantic Question validation, and writes <track>/<area>/<id>.yaml.draft. The .draft suffix keeps drafts out of vault check / vault build until promotion. - ID allocator scans corpus + existing drafts so a batch run gets distinct fresh IDs without touching id-registry.yaml. - Modes: --gap-index, --gaps-from + --limit, --dry-run. validate_drafts.py (3.b): - Five gates per draft: schema (Pydantic), originality (cosine vs in-bucket neighbours via BAAI/bge-small-en-v1.5; matches the corpus embeddings.npz so values are comparable; cutoff 0.92), level_fit (Gemini-judge against same-level exemplars), coherence (Gemini-judge: scenario/question/solution consistency), and bridge (Gemini-judge: chain-fit between the gap's two anchors). - Final verdict pass iff every non-skipped gate passes. - Skips: --no-originality, --no-llm-judge. - Output: interviews/vault/draft-validation-scorecard.json. Smoke checks: - 3.a --dry-run --gap-index 0: resolves gap, builds prompt, allocates cloud-4579. Synthetic Gemini response Pydantic-validates clean. - 3.b on a synthetic /tmp draft: schema + originality pass (top neighbour cosine 0.73 vs 0.92 threshold). Phase 3.c (pilot run on 30 gaps) deferred: it generates new YAML question content that needs human review before promotion. The tooling ships ready; running it is a user-supervised step. CHAIN_ROADMAP.md Progress Log + Phase 3 status updated.	2026-05-01 11:31:06 -04:00
Vijay Janapa Reddi	bff166bb9b	docs(vault-cli): roadmap log — 4.2 audit (no-op) + 4.8 ship + status - 4.2: audited multi-chain memberships; 0 qids in >1 chain because the lenient sweep was scoped to uncovered buckets (no overlap with primary). Deferred the focused playwright test until Phase 3 authoring makes the case live. - 4.8: marked complete; cross-ref to f086b6f42. - Header timestamp + status snapshot updated.	2026-04-30 20:27:24 -04:00
Vijay Janapa Reddi	9680e8e9fd	feat(vault+staffml): Phase 2 — tier surfacing, schema → TS → UI Carries the primary/secondary chain tier (from Phase 1) through the build pipeline into the practice + explore surfaces, so primary chains are the unmarked default and secondary chains are an opt-in alternative path the user can deep-link into via ?chain=<id>. Backend (2.1): - legacy_export.py emits chain_tiers per question alongside chain_ids and chain_positions; missing chain-tier defaults to "primary". - vault build re-run: 2953 chained questions, all carry chain_tiers (releaseHash unchanged — new field is additive, doesn't perturb the manifest hash inputs). - Existing legacy_export tests were stale (asserted on the v1.0 YAML chains: field path; v1.1 made chains.json the sidecar source). Rewrote them to write chains.json fixtures into tmp_path and added chain_tiers assertions, plus a focused test_chain_tiers_emitted_per_membership case. TypeScript (2.2): - Question.chain_tiers? (Record<string, "primary"\|"secondary">) - ChainTier export, ChainInfo.tier required. - getChainForQuestion / getAllChainsForQuestion populate tier; getAllChains... sorts primary first. - New getPrimaryChainForQuestion(qid) helper for default surfaces. UI (2.3): - practice page reads ?chain=<id> URL param; defaults to getPrimaryChainForQuestion when unset. - ChainBadge gains an inline "alt path" pill when tier=secondary (always visible — no click needed). - ChainStrip mirrors that pill in the progress row for users who expand the strip. - Explore page prefers the first non-secondary chain when picking activeChainId for the related-questions panel. - Deferred to a follow-up commit (intentional, scoped via Progress Log): explore-page "Primary only / All" filter; daily/mock routing. Tests (2.4): - test7_tier_aware_chain_routing in chain-and-vault-smoke.mjs: secondary reachable via ?chain=, alt-path badge visible on secondary, primary regression, alt-path badge ABSENT on primary. - Full smoke suite: 17/17 pass (was 13/13). Validation: - vault check --strict: 10,701 loaded, 0 failures - vault build --legacy-json: 9438 published, chainCount=879 - pytest interviews/vault-cli/tests: 74/74 - npx tsc --noEmit: 0 errors - playwright chain-and-vault-smoke: 17/17 Phase 2 complete. Next: Phase 3 (gap-driven authoring; 407-gap backlog).	2026-04-30 20:22:54 -04:00
Vijay Janapa Reddi	83fe0f7193	feat(vault): Phase 1 — second-pass chain coverage build (373 → 879) Diagnoses uncovered (track, topic) buckets and runs a relaxed Gemini sweep targeting them. New chains tier="secondary"; pre-existing chains backfilled tier="primary". Tools (Phases 1.1, 1.2/1.3, 1.5): - diagnose_chain_coverage.py: surface buckets with no chains (committed earlier on yaml-audit) - build_chains_with_gemini.py: --mode lenient adds Δ ∈ {0,1,2,3} (committed earlier on yaml-audit) - merge_chain_passes.py: merges primary + secondary, enforces the multi-membership cap (max 2 chains/qid; non-L1/L2 capped at 1) Sweep (Phase 1.4): - 17 Gemini-3.1-pro-preview calls, ~22 min wall time, 211 buckets - 506 chains accepted (above the 200-400 estimate), 269 new gaps - validator caught a few cross-bucket and Δ=4 hallucinations inline - Δ distribution: Δ=1 69.1%, Δ=2 21.1%, Δ=3 4.6%, Δ=0 5.2% (10.9% of chains contain at least one Δ=0 — within target band) - random spot-check of 5 Δ=0 chains: all share scenario threads (DMA, CMSIS-NN, on-device routing, PB-scale pipelines) Coverage gains (chains/topic before → after): - cloud 2.95 → 4.37 (242 + 116 secondary) - edge 0.64 → 2.59 ( 49 + 148 secondary) - mobile 0.74 → 2.56 ( 46 + 113 secondary) - tinyml 0.80 → 2.64 ( 36 + 83 secondary) - global 0.00 → 0.96 ( 0 + 46 secondary) Buckets with ≥1 chain: 102 / 313 (33%) → 285 / 313 (91%). Validation: - apply_proposed_chains.py --dry-run: validation clean (879 chains) - vault check --strict: 10,701 loaded, 0 invariant failures - vault build --legacy-json: chainCount 373 → 879, release_hash rolled to 04ee8a23… - playwright chain-and-vault-smoke.mjs: 13/13 pass Phase 1 complete. Next: Phase 2 (tier surfacing in staffml UI).	2026-04-30 20:12:27 -04:00
Vijay Janapa Reddi	af5f25f543	feat(vault-cli): diagnose_chain_coverage.py — surface buckets needing chains Loads the published corpus (via vault_cli.policy — single source of truth) and chains.json, buckets by (track, topic), and emits chain-coverage.json with two cuts: - uncovered_buckets: ≥3 questions, 0 chains - under_covered_buckets: ≥6 questions, ≤1 chain Plus per-track summary + top-10 uncovered for quick read. Output is gitignored — regeneratable, fed to Phase 1.4's --buckets-from. Phase 1.1 of CHAIN_ROADMAP.md. See progress log for the run results (211 uncovered buckets, edge/mobile/tinyml chain density 0.6-0.8 vs cloud's 2.95, biggest miss is cloud:roofline-analysis at 144q/0 chains).	2026-04-30 18:15:59 -04:00
Vijay Janapa Reddi	671b37b37b	docs(vault-cli): CHAIN_ROADMAP.md — resumable plan for chain coverage workstream Canonical document for the multi-phase chain growth plan. Future Claude sessions read this first to resume exactly where the previous session left off. Structure: - Resume Here: how to verify state + pick up the next step - Current state snapshot: validators, counts, branch tip - Phase 1: second-pass coverage build (373 -> ~700 chains) - Phase 2: tier surfacing in schema + UI - Phase 3: gap-driven authoring (using gaps.proposed.json) - Phase 4: misc parallel items (CI gates, multi-chain UI, etc.) - Recommended execution order over 4 weeks - Progress Log: append-only notes after each step Initial Progress Log entry captures session state through commit `1ac7d4c56` (Gemini chain rebuild applied, 373 chains live).	2026-04-30 17:38:13 -04:00
Vijay Janapa Reddi	c824ac6ed1	refactor(staffml): retire prod static-fallback; opt-in dev-only (#1598 ) The bundled corpus.json was serving as a prod safety net behind the Cloudflare Worker. Post-cutover the Worker has been the real data source, and the static path was silently degrading rather than helping (corpus.json is a generated artifact whose prose `details` are blank in corpus-summary.json). This change: - Stops emitting corpus.json in the publish-live workflow - Removes the Worker-error fallback in getQuestionFullDetail — errors now propagate to useFullQuestion and the UI shows a "details unavailable" banner instead of silently filling blanks - Drops the localhost auto-trigger in shouldUseStaticDetails — the static path now requires explicit NEXT_PUBLIC_VAULT_FALLBACK=static - Switches taxonomy.ts to corpus-summary.json (was corpus.json) - Rewrites the publish-live smoke tests against corpus-summary.json - Collapses validate-vault.py to sparse-only (per-question deep validation lives in `vault check --strict`) Static-fallback remains as an OPT-IN local-dev affordance: set NEXT_PUBLIC_VAULT_FALLBACK=static and run `vault build --legacy-json` to materialize corpus.json. The Function-constructor dynamic import keeps Turbopack from requiring corpus.json at build time. useFullQuestion hook signature changed from `Question \| undefined` to `{ question, status }`. Callers updated: practice and plans pages (both render an amber "details unavailable" banner when status is 'error'). Deleted dead cutover scaffolding: corpus-source.ts (router with no UI consumers), corpus-vault.ts (worker-only mirror, never wired up), useVaultQuestion.ts (unused migration hook), vault-fallback.ts (only consumer was corpus-source.ts). Deleted stale docs: staffml/scripts/DEPRECATED.md, vault-cli/docs/ CUTOVER_QA.md, three vault/docs/RESUME_PLAN_*.md. Verified locally: tsc clean, vitest 37/37, next build produces all 15 static routes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:47:03 -04:00
Vijay Janapa Reddi	5131cb28fc	docs: R11 stability cleanup + v2.6 \u2014 11 rounds, convergence declared R11 (David, fresh-eyes stability check): 0 Critical + 0 High + 1 Medium (doc cleanup from R10-F-2 closure itself). R11-M-1 (MEDIUM): CUTOVER_QA.md + vault-cli/README.md still referenced --canary-percent flag after R10-F-2 removed it from code + ARCHITECTURE.md. Operator following CUTOVER_QA.md step 1 of cutover day would hit 'Error: no such option --canary-percent' \u2014 the one document whose entire purpose is cutover correctness. Fix: CUTOVER_QA.md \u00a71 replaces canary-staged rollout with all-or-nothing ship language + Phase-7-deferred note pointing at \u00a74.3. README.md:57 drops [--canary-percent N] from the ship example. STABILITY DECLARED after R11. Three consecutive rounds (R7, R8, R11) with zero new Criticals. R11 explicit: 'convergence confirmed.' Finding-density trajectory across 11 rounds (new Criticals per round): R1: 3, R2: 1, R3: 2, R4: 3, R5: 3, R6: skipped, R7: 0, R8: 0, R9: 1* (regression-detect, not new), R10: 0, R11: 0 Total findings closed across all rounds: ~120. No further rounds scheduled. ARCHITECTURE.md header bumped v2.5 \u2192 v2.6. REVIEWS.md adds 'Rounds 7\u201311' section with per-round finding counts, notable findings, meta-observation on R9 (tooling/persistence issue Gemini caught that individual-file reviewers couldn't), and the convergence signal.	2026-04-16 16:42:39 -04:00
Vijay Janapa Reddi	f25f9e8184	feat(vault): B.1-B.7 + B.13 + B.15 + B.17 \u2014 finish bucket B Worker hardening (interviews/staffml-vault-worker/src/index.ts rewritten): - B.1 Cloudflare Cache API wired via caches.default; cache key is /__vault__/<release_id>/<path> so each release is a disjoint namespace. Deploy changes release_id \u2192 all old entries miss atomically. Degraded responses are NEVER cached (would poison the namespace). - B.3 Keyset pagination: cursor is {after_id, filter_hash}. Server computes filter_hash per-request and rejects cross-filter cursor reuse with 400. Pagination cost drops from O(offset + N) to O(N) per page. - B.4 Rate limiting via RATE_LIMIT_KV (src/rate_limit.ts): token bucket per (IP, class) windowed at 60s. 'default' 60 rpm, 'search' 10 rpm. Returns 429 with Retry-After header. Open-allows if KV not bound so the local vault-api shim still works. - /search uses FTS5 MATCH when questions_fts exists; fallback to LIKE for pre-FTS5 D1 instances. Escapes FTS5 special chars to prevent MATCH injection. vault-api.ts circuit breaker (B.2 \u2014 Soumith R3-F-2 fix): - Proper closed \u2192 open \u2192 half-open state machine. Half-open admits exactly one probe; failure \u2192 re-open immediately, success \u2192 close. - AbortSignal.timeout(10_000) per-attempt; AbortSignal.any() combines with caller's signal so React unmounts don't count as failures. - Retry only on retryable statuses (408/425/429/5xx/network), not on 4xx user errors or caller-aborted fetches. - Module-level _singleton so multiple makeClientFromEnv() share breaker state. __resetSingleton() exposed for tests. Worker vitest suite (B.6 \u2014 staffml-vault-worker/tests/worker.test.ts): 6 tests: rate-limit under/over cap with Retry-After; schema-fingerprint placeholder forces degraded mode; real fingerprint clears flag; cursor filter_hash mismatch returns 400; CORS echoes allowed origin; 405 on POST/PUT/DELETE; /admin/release returns 404 (no auth footgun). vault ship real hooks (B.15 \u2014 commands/release.py): - d1_forward: pnpm exec wrangler d1 execute <env-db> --file <migration.sql> - d1_rollback: applies d1-rollback.sql (SQL path); snapshot path remains primary per \u00a76.2. - nextjs_forward: pnpm run deploy:<env> from site_dir. - nextjs_rollback: pnpm exec wrangler pages deployment list (lets operator pick rollback target). - paper_forward: git tag -a v<version> && git push origin v<version>. - --skip-legs allows shipping subset (e.g., skip=paper for pre-tag validation). Content-hash SLI workflow (B.5 \u2014 .github/workflows/vault-content-hash-sli.yml): Hourly GitHub Action samples 20 IDs from latest release's vault.db, fetches same IDs from production worker, recomputes canonical content_hash in Python, asserts parity. Files a priority-high issue on mismatch. Avoids porting hashing.py canonicalization to TypeScript (Chip R3-H5's invariant-bomb risk). JSON schemas (B.7 \u2014 vault-cli/docs/JSON_OUTPUT.md): Full stable shapes for build, publish, ship, new, rm, move, renumber, restore, promote, mark-exemplar, snapshot, migrations-emit, export-paper, tag, deploy, rollback, generate. Plus notes for serve/api (not JSON-emitting \u2014 long-running servers). Codegen hash baseline (B.13 hash-check variant): vault codegen --check now computes SHA-256 over 3 shared artifacts and compares to committed interviews/vault-cli/codegen-hashes.txt. First run auto-records baseline; subsequent runs enforce no drift. Full LinkML-driven regeneration remains a Phase-2 follow-up. Baseline recorded this commit. Component migration hook (B.17 \u2014 staffml/src/lib/hooks/useVaultQuestion.ts): Minimal React hook that routes through corpus-source.ts. Components opt into the cutover by importing from here; existing corpus.ts callers remain untouched. Cutover-day swap is one import per component, not a big-bang replacement. 28/28 pytest still green. release_hash 1b304282... unchanged (no content-affecting mutations).	2026-04-16 14:04:03 -04:00
Vijay Janapa Reddi	6dff01c065	docs(vault): Phase 0 documentation deliverables EVOLUTION.md (fixes H-1 from REVIEWS.md) Schema-version rules: SemVer semantics (additive-minor implicit, breaking-major bumps schema_version). Loader contract across versions. vault migrate-schema mechanics: parallel tree, forward/ rollback functions, --dry-run, failure log. Mixed-version PRs forbidden — CI rejects. Canonicalization-version (CANON_VERSION) bumps separate from schema_version. Historical record stub. EXIT_CODES.md Stable exit-code taxonomy table with rationale for each category (0 vs 1, 1 vs 2, 3 vs 4, 5 as user-abort). Usage in code, tests, JSON output. Evolution policy: add new codes, never renumber. JSON_OUTPUT.md Common envelope: {ok, exit_code, exit_symbol, command, cli_version, data, errors, warnings}. Per-command schemas for check, stats, verify, doctor, diff. LSP-diagnostic shape for check errors. --json-schema meta-command prints per-command JSON Schema. CONTRIBUTING.md (fixes H-17) Quick-start path from clone → local site serving a question in ≤10min target. What can be contributed, workflow, PR review. Provenance-honesty rules. Author attribution via vault/contributors.yaml. Phase-by-phase scope of what works today vs what lands later. All four are referenced directly from ARCHITECTURE.md sections.	2026-04-15 21:25:52 -04:00
Vijay Janapa Reddi	eaca50116a	docs(vault): detailed testing plan and cutover QA checklist TESTING.md fleshes out ARCHITECTURE.md §19 with concrete inventory: - Test pyramid: unit / integration / CLI contract / data-migration / equivalence / codegen-drift / worker contract / E2E Playwright / smoke / load / rollback / SLI probes. - Fixtures: 20-question frozen corpus, golden vault.db, 15 schema- drift fixtures covering each invariant class, cross-release fixtures for migrations. - Per-layer test file inventory (every vault subcommand has a named contract test). - CI workflow spec: .github/workflows/vault-ci.yml for every PR plus nightly and deploy workflows. PyYAML + Python pinned for hash stability. - Phase-entry gate table: what testing artifacts block each phase transition. - Observability + rollback protocol for Phase 4. CUTOVER_QA.md: expanded from §19.4 into a sequential operator runbook. Pre-cutover gates, vault ship canary stages, 8 flow checks (home, practice, gauntlet, progress, about, search, chain UX, offline), network/bundle verification, rollback drill (rehearsed on staging first), 48h post-cutover watch cadence, rollback-trigger conditions, post-cutover sign-off checklist. Both docs are living — TESTING.md evolves as CLI surface grows; CUTOVER_QA.md is versioned per release.	2026-04-15 18:07:27 -04:00

29 Commits