cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-07 10:08:50 -05:00

Author	SHA1	Message	Date
Vijay Janapa Reddi	1feb5fe7fc	fix(vault-cli/release): split E701 one-liners into multi-line conditionals Three 'if cond: stmt' single-line forms in the release-stats loop tripped ruff E701. Re-formatted to ruff-clean multi-line conditionals; behavior unchanged.	2026-05-06 08:07:15 -04:00
Vijay Janapa Reddi	f12d303769	chore(interviews): purge stale AI prompts and dev scratch from interviews/ Remove ten files from the public repo that should never have been tracked. Verified no code references any of them before deleting. AI-prompt files (private to author tooling, do not belong in the public repo): - interviews/vault-cli/docs/GEMINI_SELF_AUDIT_PROMPT.md - interviews/vault/_pipeline/runs/gemini-self-audit/prompts/{cloud, edge,global,mobile,tinyml}_audit_prompt.md (5 per-track prompts; interviews/vault/.gitignore already excludes /_pipeline/, but these five were force-added in `f6c41d7689` before the rule was set) Dev-scratch artifacts (clearly leftover dev iteration; filenames literally say 'final' four different ways): - interviews/vault-cli/check_results_absolute_final.json - interviews/vault-cli/check_results_after_repair.json - interviews/vault-cli/check_results_final.json - interviews/vault-cli/check_results_total_final.json No production code, tests, docs, or CI references any of these paths. The audit-pipeline scripts that would write into _pipeline/ already respect the existing gitignore rule for that directory tree.	2026-05-05 10:51:53 -04:00
Vijay Janapa Reddi	8e052b0a2b	fix(paper): refresh macros to current corpus + compute validated% from data The paper's auto-generated macros.tex was last regenerated when the v1.0.0 snapshot held 9,446 published questions; the post-tag audit work has since brought the published count to 9,521 (cloud +49, edge +14, mobile +2, tinyml +6, global +4) and consolidated topics from 89 to 87. Re-run `vault export-paper 1.0.0` so paper and site agree by construction. While here, fix a bug in the export-paper command itself: \numvalidated was hardcoded to 100.0\% regardless of the actual flag distribution. The flag isn't compiled into vault.db, so we read it back from the source YAMLs and emit the real percentage. Current state is 92.4\% (8,794 of 9,521 published questions carry validated=true). The drift came from new questions added without the flag set; the conservative fallback if the YAML scan fails preserves the legacy 100.0\% so the build never breaks. The macros change is the meaningful diff. release.json for 1.0.0 is left untouched to preserve the historical release metadata; vault.db is gitignored anyway so contributors rebuild it locally via `vault build` before paper renders.	2026-05-05 10:24:58 -04:00
Vijay Janapa Reddi	713d719c3f	merge origin/dev into yaml-audit Brings in the dev-side prose / bib / math fixes that landed since the yaml-audit branch was cut, and resolves three small conflicts: * interviews/vault-cli/scripts/archive/split_corpus.py origin/dev deleted it (archive cleanup); we honor the deletion. * interviews/vault-cli/scripts/validate_drafts.py origin/dev removed a leftover no-op statement; took theirs. * interviews/vault-cli/scripts/summarize_proposed_chains.py origin/dev renamed loop var lvl→level; took theirs. The two protected qmds (data_selection.qmd, model_compression.qmd) are temp-stashed before the merge to honor the 'do not touch' rule; restored after the merge commit lands. After this commit, yaml-audit contains every commit on origin/dev as an ancestor, so dev can fast-forward to yaml-audit's tip when the maintainer is ready to merge.	2026-05-05 10:03:14 -04:00
Vijay Janapa Reddi	edcdba08da	docs(staffml,vault-cli): document the local-dev corpus pipeline Add interviews/staffml/README.md covering the local development workflow that the prior commit's predev hook relies on: - TL;DR install + run-dev steps - explanation of the production-worker vs local-static data flow - what the predev hook does (sync-periodic-table + vault build --local) - env vars (NEXT_PUBLIC_VAULT_FALLBACK, NEXT_PUBLIC_VAULT_API, STAFFML_SKIP_LOCAL_CORPUS) and their effects - troubleshooting the three failure modes that bit us during the YAML audit work (could-not-load, stale content, infinite loading) Update interviews/vault-cli/README.md to surface `vault build --local` in the Local-dev section with a pointer to the StaffML README. The intent: a contributor who edits a YAML and doesn't see the change in the dev server should now find the answer in the README before they're forced to read the loader source.	2026-05-05 09:33:43 -04:00
Vijay Janapa Reddi	c7b42e41d8	fix(dev): make `npm run dev` serve full question content from local YAMLs Before this change, the StaffML Next.js dev server fetched scenario and details (including napkin_math) from the production Cloudflare Worker even when contributors had local YAML edits — so changes weren't visible without shipping. The opt-in static-fallback path existed but was wired incorrectly: getStaticFullDetail used a Function-constructor dynamic import of ../data/corpus.json, which Turbopack rewrote to a non-existent /_next/static/data/corpus.json URL and 404'd at runtime. Fix in three parts: 1. Loader (interviews/staffml/src/lib/corpus.ts): replace the broken dynamic import with fetch('/data/corpus.json'). On failure, throw a clear error pointing at `vault build --local`. 2. Build (interviews/vault-cli/src/vault_cli/commands/build.py): mirror the generated corpus.json into interviews/staffml/public/data/ so Next serves it as a static asset. Add --local as a clearer alias for --local-json and update the help text to spell out the dev workflow. 3. Wiring (interviews/staffml/package.json + scripts/build-local-corpus.mjs): predev now runs `vault build --local` automatically, with a soft-fail path if the vault CLI isn't installed (so first-time contributors still get a working dev server, just with the worker fallback). The committed .env.development sets NEXT_PUBLIC_VAULT_FALLBACK=static so the static path is the default in dev. Both copies of corpus.json are gitignored as build artifacts (the YAMLs are the source of truth).	2026-05-05 09:30:57 -04:00
Vijay Janapa Reddi	e465587959	docs(vault-cli): GEMINI_SELF_AUDIT_PROMPT.md — agentic audit via gemini CLI A self-contained prompt that lets gemini CLI walk the corpus and audit it directly via its own filesystem tools, without the audit_corpus_batched.py Python wrapper. Useful when the wrapper hits rate-limit / exit-55 walls or when the operator wants Gemini to checkpoint to disk as it goes. The prompt uses an append-only JSONL output at interviews/vault/_pipeline/runs/gemini-self-audit/01_audit.jsonl with resume semantics (re-running skips qids already in the file). Encodes the same five gates as audit_corpus_batched.py (format_compliance, level_fit, coherence, math_correct, title_quality) plus a stable JSON shape so downstream tooling can consume it identically. Includes invocation guidance: --yolo + --skip-trust, slice by track to avoid the multi-hour serial walk, resume across sessions.	2026-05-04 10:36:31 -04:00
Vijay Janapa Reddi	463a180258	fix(vault-cli): _judges adds --skip-trust to gemini invocation The gemini CLI silently overrides --yolo to default approval mode when its cwd is not in the trusted-folders list (e.g., a tempfile.gettempdir scratch dir). The override is logged to stderr as 'Approval mode overridden to "default" because the current folder is not trusted' and the call exits 55. --skip-trust opts out of that gate. Verified 2026-05-04 in /tmp/gemini-trust-test.	2026-05-04 10:35:13 -04:00
Vijay Janapa Reddi	d53d2e4b2d	fix(vault): resolve metadata gaps + promote 41 audit-clean drafts Three gap-fixes a corpus audit on 2026-05-04 surfaced: 1. 55 cloud YAMLs were missing the status field entirely; Pydantic silently defaulted them to 'draft', so audit_corpus_batched skipped them. fix_missing_metadata.py adds explicit status: draft + provenance: imported. 2. 59 deleted YAMLs lacked the deletion_reason that the soft-delete pairing rule requires. Added placeholder text noting the original reason was not preserved on import. 3. The 55 newly-explicit drafts went through a focused vault audit (gates: format/level_fit/coherence/math/title). 41 passed all five gates and were promoted to status: published. The remaining 14 had real issues (13 level_fit / 2 coherence / 1 math) and stay drafts for authoring follow-up. audit_corpus_batched.py now accepts non-published YAMLs when --qids is explicit (the operator opted in). Default behavior (full-corpus audit) is unchanged: published-only. On-disk corpus now: 9,487 published (was 9,446, +41) · 423 drafts · 386 flagged · 390 deleted · 25 archived · 0 missing-status. vault check --strict and pytest both clean.	2026-05-04 09:06:43 -04:00
Vijay Janapa Reddi	bc26a0bf37	feat(vault): Phase 6 schema tightening — markers + Details forbid + invariant Three coordinated edits to lift the marker convention from a soft draft-validation gate to a published-corpus invariant: 1. interviews/vault/schema/question_schema.yaml (LinkML, source of truth): common_mistake and napkin_math gain regex patterns matching the AUTHORING.md Pitfall/Rationale/Consequence and Assumptions/ Calculations/Conclusion conventions. Documents the spec; enforced in the validator below. 2. interviews/vault-cli/src/vault_cli/models.py (Pydantic, derived): Details flips from extra='allow' to extra='forbid'. A pre-flight survey on 2026-05-04 across all 10,711 YAMLs found 0 unknown keys on Details, so the historical 'imported legacy fields' risk no longer applies. 3. interviews/vault-cli/src/vault_cli/validator.py: structural_tier gains _check_format_markers (invariant #19), which flags published YAMLs whose non-empty cm/nm doesn't match the AUTHORING.md markers. Drafts are exempt — author-in-progress drafts may still have malformed markers. Lifts gate_format from validate_drafts.py / _judges.py from a CI-time gate to a vault-check-strict invariant. Tests: 4 new cases in test_models covering Details forbid, marker- compliant pass, malformed cm fail, and draft-exempt skip. Total 88 passing (was 84). codegen-hashes.txt updated for the models.py edit; vault codegen --check passes. The on-disk corpus is fully clean post-Phase-5+drain: vault check --strict reports 10,711 loaded, 0 invariant failures, 0 format- marker violations on published YAMLs.	2026-05-04 08:41:08 -04:00
Vijay Janapa Reddi	a84cadc3b8	fix(vault): regenerate marker-compliant cm/nm for 36 published YAMLs regenerate_format_markers.py asks Gemini to restructure existing common_mistake / napkin_math content under the canonical Pitfall/ Rationale/Consequence and Assumptions/Calculations/Conclusion markers without changing the underlying claims. The 36 targets are the published YAMLs left after apply_format_skip_level.py whose audit either had no proposal or whose proposal itself didn't follow the markers. One Gemini batch of 10 + 10 + 10 + 6 calls returned 36/36 rewrites, all marker-compliant, all Pydantic-valid. Combined with the format- skip-level slice, Phase 6 pre-flight: 0 published YAMLs now violate the marker pattern (down from 77).	2026-05-04 08:35:18 -04:00
Vijay Janapa Reddi	6e788042ae	feat(vault-cli): apply_format_skip_level + 41 marker fixes apply_format_skip_level.py applies marker-compliant common_mistake / napkin_math corrections for published qids whose proposed fix got skipped during Phase 5 because the row was entangled with a level relabel (relabel-up or chain-monotonicity-block) or a high-risk realistic_solution rewrite. The script applies ONLY the format fields when the current YAML's value is malformed AND the proposed value matches the AUTHORING.md markers. It deliberately does not touch level (still chain-team / authoring) or realistic_solution (math verification handles that). Phase 6 pre-flight: a survey on 2026-05-04 found 77 published YAMLs with malformed markers. This pass fixes 41 of them. Remaining 36 have no marker-compliant proposal in the audit and need a fresh authoring round before the LinkML pattern can land cleanly.	2026-05-04 08:25:14 -04:00
Vijay Janapa Reddi	ac2c7b39eb	docs(vault-cli): PHASE_5_UNRESOLVED.md — post-drain accounting Reflects the 2026-05-04 follow-up slices: math-skip-level (15 applies) and math-finish queue drain (66 applies). Cumulative now 2,372 of 2,757 (86.0%); 385 known-deferred ahead of Phase 6. Also corrects the original doc's '70 already-applied no-ops' line — those were unverified math candidates the verify guard skipped, not no-ops.	2026-05-04 08:14:16 -04:00
Vijay Janapa Reddi	3a14b6fbb7	feat(vault-cli): apply_math_skip_level + broaden verify guard apply_math_skip_level.py is a Phase 5 cleanup helper. For the small set of qids whose math fix carries a level relabel that's chain-blocked or relabel-up, the math correction is independently verified and applies cleanly — only the level relabel is the chain-team / authoring decision. This script applies napkin_math/realistic_solution/common_mistake while leaving level untouched, writing a 05_math_skip_level.json sidecar. verify_math_corrections.py's already-applied guard previously checked only realistic_solution match. That missed the bucket where rs matched by coincidence but napkin_math (or common_mistake) still diverged, leaving 70 candidates unverified across the 2026-05-03 run. The guard now considers all three math fields.	2026-05-04 08:13:52 -04:00
Vijay Janapa Reddi	2dc556e1e5	docs(vault-cli): PHASE_6_HANDOFF.md — resume guide after Phase 5 mass-apply Self-contained resume guide for the next session: - Confirms Phases 0-5 (autonomous) + 8 done - Documents 478 unresolved corrections (cross-refs PHASE_5_UNRESOLVED) - Step-by-step for Phase 5 cleanup → Phase 6 schema → Phase 7 verify → Phase 9 release - Concrete CLI commands for each step (vault audit review with --filter-gate flags, vault codegen, vault publish) - Reference doc map (which doc covers what) - Pipeline data layout (where the canonical 01_audit.json lives) - Full commit log from this session - Merge command to land yaml-audit on dev when ready - Paste-ready resume prompt for the next Claude Code session Total estimated remaining work to ship vault 1.0.0: ~9h, mostly Phase 5 review + Phase 6 schema. Tree is clean; ready to hand off.	2026-05-04 07:14:47 -04:00
Vijay Janapa Reddi	79b4c3361e	docs(vault-cli): PHASE_5_UNRESOLVED.md — list of corrections needing human review After the autonomous Phase 5 mass-apply + math-verify passes, 2,279 of 2,757 corrections (82.6%) were auto-applied. The remaining 478 were deliberately not applied because they fail one of three safety checks: 75 math 'no' — independent Gemini check disputed the fix 14 math 'unclear' — Gemini wasn't confident 13 math + level-block — fix has level relabel that breaks a chain 168 relabel-up — against CORPUS_HARDENING_PLAN.md §10 Q3 138 chain-block — would break chains.json monotonicity 70 already-applied — no action needed This doc: - Summarizes the skip reasons + counts - Points to the disposition logs in _pipeline/runs/ - Recommends a per-category review workflow - Notes which categories are highest priority (math 'no') - Notes which are chain-restructuring decisions (out of Phase 5 scope) Reviewer flow uses `vault audit review` (apply_corrections.py wrapper) with --filter-gate to target specific buckets. Phase 5 autonomous portion is COMPLETE. Phase 6 (schema tightening) remains safe to attempt once the 478 are dispositioned or accepted as known-deferred.	2026-05-03 19:17:46 -04:00
Vijay Janapa Reddi	04c69e6a5b	feat(vault-cli): verify_math_corrections.py — Phase 5 math-fix verifier Independent Gemini verification pass for the 376 high-risk corrections that include realistic_solution rewrites (math-driven fixes). Process: 1. For each row with a realistic_solution rewrite, build a payload with: scenario, question, original solution, proposed napkin_math, proposed realistic_solution. 2. Batch ~10 per call; ask Gemini to RE-DERIVE the answer from the scenario as if it hadn't seen the proposed answer, then compare. 3. Each item gets verdict: yes / no / unclear. 4. Auto-apply ONLY 'yes' verdicts subject to: - Pydantic validation (must pass before write) - Chain monotonicity check (level relabels can't break chains) - Relabel-up policy (relabel-down only) Verification prompt explicitly instructs Gemini to default to "unclear" when uncertain — strict bar for auto-apply. Outputs: 03_math_verification.json per-qid verdict + rationale 04_math_applied.json per-qid apply result Note: forced past .gitignore's `*/VERIFY_.py` rule (case-insensitive match on macOS). The rule was for legacy LLM-generated scratch files; this is intentional production tooling. CORPUS_HARDENING_PLAN.md Phase 5 — math-fix verification leg.	2026-05-03 19:08:48 -04:00
Vijay Janapa Reddi	15811ef4bc	feat(vault-cli): mass_apply_corrections.py — Phase 5 low-risk auto-applier Automates the safe subset of Phase 5 review work. Reads a 01_audit.json from a --propose-fixes run and auto-applies LOW-risk corrections without prompting. HIGH-risk corrections (anything rewriting realistic_solution) are skipped — those need separate math verification. Risk classification: LOW : correction touches only ⊆ {title, level, common_mistake, napkin_math} HIGH : any correction including realistic_solution Defensive checks for level relabels (caught real bugs during 2026-05-03 smoke): 1. Relabel-UP blocked — policy is relabel-down only (§10 Q3). Gemini will sometimes propose L3→L4 even with the prompt asking for down; we filter regardless. 2. Chain-monotonicity check — chains.json requires non-decreasing levels along chain positions. A relabel that drops a member below its predecessor breaks the chain. The check overlays prior applies in the same run so cascading same-chain relabels don't slip through. Pydantic validation runs BEFORE writing each YAML; failures don't write. Atomic temp+rename keeps state consistent under interruption. Outputs disposition sidecar at <run-dir>/02_mass_apply.json with per-qid result + reason. Used to apply 2,075 of 2,381 low-risk corrections from the 2026-05-03 audit dataset (138 chain-monotonicity blocks, 168 relabel-up blocks). 0 Pydantic failures. CORPUS_HARDENING_PLAN.md Phase 5 — low-risk leg.	2026-05-03 19:06:32 -04:00
Vijay Janapa Reddi	9ee3c34303	docs(vault-cli): PHASE_4_HANDOFF — update post-backfill Append 2026-05-03 update reflecting: - Phase 4 backfill complete (2,757 corrections proposed; 0 errors) - 6 cloud questions migrated (stray top-level MCQ fields → details) - Phase 8 CLI subcommand shipped (vault audit run/review/summarize/merge) Next session can skip Step 1 (backfill — done) and start at Step 2 (Phase 5 interactive review).	2026-05-03 18:39:37 -04:00
Vijay Janapa Reddi	87481ab6a3	docs(vault-cli): refresh AUDIT_FINDINGS_2026-05-03 after Phase 4 backfill After running --propose-fixes backfills on cloud + edge failures, the canonical merged audit dataset now has: - 9,446 questions audited (100%) - 0 errors (all retried clean) - 2,757 with suggested_corrections (up from 1,767; +990 new fixes from cloud + edge backfills) Per-track: cloud: 4,028 / 851 with fixes edge: 2,079 / 669 global: 313 / 90 mobile: 1,824 / 677 tinyml: 1,202 / 470 Phase 5 (apply_corrections.py interactive review) can now begin on the 2,757-row subset. CORPUS_HARDENING_PLAN.md Phase 4 backfill complete.	2026-05-03 18:38:55 -04:00
Vijay Janapa Reddi	68012912f8	feat(vault-cli): vault audit CLI subcommand — Phase 8 Wraps the existing scripts under one user-facing surface: vault audit run → audit_corpus_batched.py vault audit review → apply_corrections.py vault audit summarize → summarize_audit.py vault audit merge → merge_audit_runs.py Each subcommand is a thin shell around subprocess.run on the corresponding script. Args are forwarded; exit codes propagate. The cron workflow (staffml-audit-corpus-monthly.yml, shipped earlier) invokes the underlying scripts directly and is unchanged. Humans now reach for `vault audit run --all --propose-fixes` instead of the script path. vault audit --help shows the 4 subcommands cleanly. pytest 84/84; ruff clean. CORPUS_HARDENING_PLAN.md Phase 8 (CLI half).	2026-05-03 18:26:50 -04:00
Vijay Janapa Reddi	87adaeec2f	docs(vault-cli): PHASE_4_HANDOFF.md — resume guide for next session Self-contained instructions for picking up where Phase 4 left off: - Sanity checks for the worktree (vault check, pytest, ruff) - Phase 4 backfill steps (cloud + edge --propose-fixes, retry global errors, re-merge, regenerate AUDIT_FINDINGS) - Phase 5 review workflow (apply_corrections.py with --filter-gate + --auto-accept-format for low-risk fixes) - Phase 6 schema-tightening checklist (LinkML pattern constraints, Details extra="forbid", lift gate into validator) - Phase 7 title verification - Phase 8 vault audit CLI subcommand (cron is already shipped) - Phase 9 release pipeline Includes: - Concrete CLI commands for every step - Cost estimates per step - Tooling reference (which script does what) - Open questions from CORPUS_HARDENING_PLAN.md §10 still to decide - Full commit log from this session - Troubleshooting (rate limits, codespell, scratch files) Total estimated time to ship vault 1.0.0: ~12h, mostly Phase 5 human review. Spread over 2-3 working days. CORPUS_HARDENING_PLAN.md Phase 4 → Phase 5 transition.	2026-05-03 14:31:45 -04:00
Vijay Janapa Reddi	d2621cc9ed	feat(vault-cli): merge_audit_runs.py + Phase 4 findings doc merge_audit_runs.py — merges multiple per-track audit_corpus_batched output dirs into one canonical run. Per-qid prefer non-error rows, then rows with suggested_corrections. AUDIT_FINDINGS_2026-05-03.md — first complete corpus audit. summarize_audit.py — truncate rationale snippets at word boundaries (was truncating mid-word, tripping codespell on words like 'claimin'). Phase 4 final stats (9,446 published questions audited): format_compliance: ~960 fail level_fit: ~1,580 fail coherence: ~480 fail math_correct: ~330 fail title_quality: ~250 placeholder + ~25 malformed 20 error rows in global to retry on next run 1,767 questions have suggested_corrections; ~1,500 more need a propose-fixes backfill pass (mostly cloud, some edge). CORPUS_HARDENING_PLAN.md Phase 4 finalization.	2026-05-03 14:26:37 -04:00
Vijay Janapa Reddi	2d9330da67	fix(vault-cli): isolate gemini CLI scratch files in temp dir The gemini CLI in --yolo mode occasionally writes scratch files (prompt_candidates.json, audit.py, evaluate_*.py, partial JSON outputs) to its CWD. When invoked from the repo root those landed alongside the worktree and polluted git status with ~30 untracked files. Fix: pass cwd=tempfile.gettempdir()/vault_audit_gemini_scratch to subprocess.run. The scratch dir is created lazily on import. This doesn't affect Gemini's output (we capture stdout) or the prompt (we pass via -p). It just keeps the gemini CLI's incidental file-system side effects out of the worktree. CORPUS_HARDENING_PLAN.md Phase 3 (delayed reliability fix).	2026-05-03 11:08:53 -04:00
Vijay Janapa Reddi	3eaac3ca93	feat(vault-cli): summarize_audit.py — Phase 4 finalization helper Reads a 01_audit.json and emits a markdown triage doc with: - executive summary (per-gate pass/fail/error counts) - per-track failure rate matrix - coherence failure-mode breakdown (vendor fabrication / physical absurdity / mismatch / arithmetic) - priority lists for human review (math errors highest, then coherence by mode, then level inflation, then placeholder titles) - regex-vs-Gemini format-disagreement audit - recommended next-step actions per category Usage: python3 interviews/vault-cli/scripts/summarize_audit.py \\ --input <01_audit.json> \\ --output interviews/vault-cli/docs/AUDIT_FINDINGS_<date>.md Smoke-tested on the in-flight Phase 4 audit (2,880 cloud rows so far). Sample findings on the partial data: - 95 math errors caught (real ones — e.g. "200 * 4 * 168 = 134,400, not $168k/week") - 385 level_fit fails (13.4% — slightly higher than global's 15.3%) - 146 coherence fails: 61 mismatch / 47 arithmetic / 37 physical absurdity / 1 vendor fabrication - 118 placeholder titles in cloud (Gemini flags more than just the "Global New NNNN" pattern; Phase 7 scope is bigger than the plan estimated) CORPUS_HARDENING_PLAN.md Phase 4.	2026-05-03 11:06:26 -04:00
Vijay Janapa Reddi	1722133faa	feat(vault-cli): apply_corrections.py — interactive accept/reject for Gemini-proposed fixes Phase 5's interactive review tool. Reads a 01_audit.json from a --propose-fixes run, walks rows with non-empty suggested_corrections, shows a unified-diff per modified field, and prompts accept/reject/ edit/skip/quit. Validates every accepted body against Pydantic before writing. Per CORPUS_HARDENING_PLAN.md correction policy: - math errors: rewrite napkin_math AND realistic_solution as a unit - level inflation: relabel DOWN, never rewrite up to match - format markers: add markers without changing prose semantics Resumable: dispositions persist to 02_dispositions.json after each decision; re-running skips already-decided qids. --auto-accept-format auto-accepts format-marker-only fixes (lower-risk). Smoke-tested against the in-flight Phase 4 audit: 0 candidates (no --propose-fixes data yet) and exits clean. CORPUS_HARDENING_PLAN.md Phase 5.	2026-05-03 11:03:53 -04:00
Vijay Janapa Reddi	1b58a9c508	feat(vault-cli): parallel audit_corpus_batched.py with submit-stagger Adds ThreadPoolExecutor parallelism to the audit run loop. Without it, a 9,446-question corpus audit would take ~14h sequential at the canary-measured ~167s/call rate. With 4-way parallelism + 1s submit stagger, the same audit fits in ~3-4h. CLI: --workers N concurrent Gemini calls (default 4, max 8) --submit-stagger SECS sleep between batch submissions (default 1.0) The submit stagger spreads the worker start times so all N workers don't slam Gemini in the same instant — correlated rate-limit hits were a concern and the stagger costs only N seconds at startup. Concurrency safety: - State (rows + seen_qids + persistent file) lives behind _state_lock. Mutations + atomic temp+rename writes happen inside the lock. - Gemini subprocess calls run OUTSIDE the lock so workers don't block each other on the slow path. - _print_lock keeps stdout/stderr legible across workers (no interleaved lines). - normalize_response now drops Gemini-hallucinated qids (returned but not in the batch) and warns to stderr. Validation: smoke-tested on edge track with --max-calls 4 --workers 4. All 4 batches started in the first 3 seconds (1s stagger ×3); all finished within 290s (vs ~683s expected sequentially — 2.35× speedup close to the ideal 4× ceiling). 0 errors, no JSON corruption from concurrent writes. The smoke-test results gave us the first edge-track Phase 4 signal: 22.5% level_fit fail rate (vs global's 15.3% — edge has higher level-inflation than global, worth tracking through Phase 5). CORPUS_HARDENING_PLAN.md Phase 4.	2026-05-03 09:41:45 -04:00
Vijay Janapa Reddi	12032f700c	fix(vault-cli): audit_corpus_batched.py reliability fixes from canary Three bugs surfaced by the global-track canary run (2026-05-03, 20260503T123116Z), all fixed: 1. Gemini-CLI subprocess timeout was 240s; canary's average call took ~167s with 72K-char prompts occasionally exceeding 240s and getting killed mid-call. 60 questions (2 batches) returned no Gemini response. Bumped default timeout in _judges.call_gemini_judge() to 600s (≈3× typical, still triggers fast on a stuck call). 2. Resume logic in run_audit() treated ANY persisted row as "audited," including the placeholder rows for batches that errored. That meant re-running on the same output dir would skip the failed batches forever. Fixed: only rows with format_compliance != "error" are added to seen_qids, so a re-run retries the failures. 3. --output passed as a relative path crashed on `outdir.relative_to(REPO_ROOT)` because relative paths don't share the absolute REPO_ROOT prefix. Fixed: resolve outdir to absolute immediately after computing it. Validation: re-ran the canary on the same output dir with all three fixes. Resume correctly skipped the 9 good batches, retried the 2 errored batches, and both completed cleanly in 785s. All 313 global questions now have real Gemini verdicts (0 errors). Canary findings: format_compliance: 21 fails, 99.6% Gemini-vs-regex agreement level_fit: 48 fails (15.3% — the predicted level-inflation pattern; flagged for Phase 5 review) coherence: 18 fails math_correct: 8 fails title_quality: 16 placeholders (matches regex 1:1) CORPUS_HARDENING_PLAN.md Phase 4 (canary leg).	2026-05-03 09:18:30 -04:00
Vijay Janapa Reddi	03031dc38e	test(vault-cli): smoke tests for audit_corpus_batched batching 7 tests covering pack_batches: - empty input → no batches - single small item → one batch - no items lost across batches (50 items, 10/batch → all 50 round-trip) - max_items_per_batch caps batch size (33 items, 10/batch → 10/10/10/3) - max_chars triggers a flush before items overflow the budget - input order preserved within and across batches - oversized single item still lands in a batch (we don't drop, the caller is expected to detect overflow downstream) The audit script itself can't easily be unit-tested in CI (it subprocess-shells the gemini CLI); the batching helper is the main piece of pure logic, so this is where the value is. 84 / 84 pytest pass (was 77; added 7) CORPUS_HARDENING_PLAN.md Phase 3.	2026-05-03 08:23:08 -04:00
Vijay Janapa Reddi	69cf6f0a5f	feat(vault-cli): audit_corpus_batched.py — full-corpus batched audit Replaces the dead-end audit_corpus.py (deleted in Phase 0). The new design batches 30-40 questions per Gemini call instead of 1 question per gate, dropping the corpus-audit cost by ~10×. Per call, ONE prompt asks Gemini for a JSON array of per-question verdicts across: - format_compliance: pass/fail (regex-checkable; cross-checked against host-side gate_format) - level_fit: pass/fail/skip + rationale (level inflation + verb mismatch + "no real judgement required") - coherence: pass/fail + failure_mode (physical_absurdity / vendor_fabrication / mismatch / arithmetic) - math_correct: pass/fail/no_math + specific errors - title_quality: good/placeholder/malformed Cost (full corpus, 9,446 published): - audit-only: ~315 calls (1.3 days at the 250/day cap) - --propose-fixes: ~+50% (denser per-batch output → smaller batches) Modes: --all full corpus (default) --tracks cloud,edge track filter --qids X,Y,Z explicit qid set --propose-fixes ALSO ask Gemini to propose corrections (per CORPUS_HARDENING_PLAN.md §10: - math errors: rewrite napkin_math AND realistic_solution as a UNIT - level inflation: relabel DOWN, never attempt to rewrite the question up) --max-calls N cap per invocation; resume by re-running --batch-size N tuning override --dry-run plan without calling Gemini Output convention: _pipeline/runs/<UTC-timestamp>/ 00_config.json — flags, model, candidate count 01_audit.json — per-question rows (resumable; rewritten after each batch so a Ctrl-C / timeout doesn't lose work) Sanity check: dry-run on full corpus packs 9,446 questions into 315 batches of 30, with payloads 55-69KB each (well under the 320KB attention sweet spot for gemini-3.1-pro-preview). CORPUS_HARDENING_PLAN.md Phase 3.	2026-05-03 08:22:58 -04:00
Vijay Janapa Reddi	dd71c66cae	feat(vault-cli): _judges.py + _batching.py — shared infra for batched audit Two new helper modules under interviews/vault-cli/scripts/. Used by the upcoming audit_corpus_batched.py (CORPUS_HARDENING_PLAN.md Phase 3) and extractable from the existing single-call scripts in a follow-up. _judges.py exports: - GEMINI_MODEL (pinned) - COMMON_MISTAKE_MARKERS (Pitfall/Rationale/Consequence) - NAPKIN_MATH_MARKERS (Assumptions/Calculations/Conclusion) - FAILURE_MODE_TAXONOMY (4-mode prose block: physical absurdity, vendor fabrication, mismatch, arithmetic) - call_gemini_judge() (subprocess wrapper + lenient JSON parse) - strip_fences() (response cleanup) - gate_format() (regex format-compliance gate, free) The taxonomy is the same prose block currently inlined in validate_drafts.py's COHERENCE_PROMPT and audit_chains_with_gemini.py's audit prompts. Centralizing it means a future failure-mode addition flows to every judge, not just one script. _batching.py exports: - MAX_PROMPT_CHARS = 320_000 (≈80K tokens, attention sweet spot) - DEFAULT_WRAPPER_CHARS (4K headroom for prompt scaffolding) - pack_batches[T]() (generic char-budgeted batcher with optional hard item cap) Generalized from audit_chains_with_gemini.py:batch_chains and build_chains_with_gemini.py:plan_batches. Properties documented in the docstring (preserves order, no items lost, oversized items still land in a batch). Followups: - migrate validate_drafts.py and audit_chains_with_gemini.py to use _judges.call_gemini_judge instead of their inlined wrappers (out of scope here; non-blocking for the audit work). CORPUS_HARDENING_PLAN.md Phase 3.	2026-05-03 08:22:39 -04:00
Vijay Janapa Reddi	f691d6c14a	feat(vault-cli): vault new scaffolds full Pitfall/Rationale/Consequence + Assumptions/Calculations/Conclusion stubs The previous scaffold only stubbed scenario and realistic_solution with <TODO> placeholders. That meant authors had to know about the markup conventions from somewhere else (the regex in validate_drafts.py, the SCHEMA_SUMMARY in generate_question_for_gap.py, or the paragraph in ARCHITECTURE.md §3.6.1) — none of which a new contributor would find. Now `vault new` produces a YAML with the canonical bold markers pre-written. Authors fill in the content between markers; they can't forget to use them. Templates extracted as module-level constants (COMMON_MISTAKE_TEMPLATE and NAPKIN_MATH_TEMPLATE in commands/authoring.py) so they're testable in isolation. New tests in test_authoring_scaffold.py guard against accidental marker removal — if a contributor edits the scaffold and drops, say, The Rationale:, the test fails immediately rather than every new question silently failing the format gate downstream. 77 / 77 pytest pass (was 74; added 3) ruff clean vault check --strict — 10,711 loaded, 0 invariant failures CORPUS_HARDENING_PLAN.md Phase 2.	2026-05-03 08:11:59 -04:00
Vijay Janapa Reddi	39d567f267	feat(vault-cli): backfill_provenance.py — Phase 1 helper Walks vault/questions/*/.yaml, finds published YAMLs with no top-level provenance line, and inserts `provenance: imported` on the line immediately after `status: published`. Idempotent — re-running is a no-op once the field is present. Limits scope to status: published; the mechanical pass should not overwrite the semantics of draft / flagged / deleted / archived questions. CLI: --dry-run report what would change --limit N cap modifications (smoke test) CORPUS_HARDENING_PLAN.md Phase 1.	2026-05-03 08:06:12 -04:00
Vijay Janapa Reddi	36f2ef5929	docs(vault-cli): CORPUS_HARDENING_PLAN.md — supersedes RELEASE_AUDIT_PLAN.md End-to-end plan for taking the published-corpus audit from "stratified sample at ~2,900 calls / 12 days" to "full corpus at ~450 calls / ~3 days". The previous plan over-budgeted by 6× because it assumed 1-call-per-gate-per-question; switching to batched 30-questions-per-call collapses the cost. Nine phases, 27 testable acceptance criteria. End state: every published YAML conforms to a strict schema with load-time-enforced format markers (Pitfall/Rationale/Consequence + Assumptions/Calculations/Conclusion); math, level-fit, coherence, vendor fabrication, and physical realism are independently Gemini-verified at corpus scale; new violations are caught at vault check --strict time and cannot silently land. Major design choices: - Audit + corrections in one tool (audit_corpus_batched.py), with a --propose-fixes mode whose suggestions are NEVER auto-applied — humans review via apply_corrections.py. - Schema tightening AFTER cleanup, not before (Phase 6 lifts pattern constraints into LinkML / Pydantic only once Phase 5 has cleaned the corpus, so the new constraints reject nothing real). - Cron the audit (Phase 8) so findings become a routine artifact. - AUTHORING.md + vault new scaffold (Phase 2) so new contributors see the format conventions before authoring, not after CI catches them.	2026-05-03 07:43:47 -04:00
Vijay Janapa Reddi	963fbfb162	docs(vault-cli): RELEASE_AUDIT_PLAN.md — handoff for fresh-session corpus audit Captures the release-readiness state of the vault and the plan for finishing the audit work the 250/day Gemini cap has constrained. Corpus health survey (9,446 published questions, no Gemini cost): - 100% schema-valid (Pydantic) - 90.9% format-compliant (Pitfall/Rationale/Consequence + Assumptions/ Calculations/Conclusion markers) - 9.1% fail format compliance (861 questions; mechanical fixes) - 134 placeholder titles (all global/* "Global New NNNN") - 407 with provenance: None (should be "imported") - 95.3% canonical bold-marker napkin_math; 4.7% partial / bullet-only Template gap noted: vault new scaffolds only scenario + solution stubs; the Pitfall/Rationale/Consequence and Assumptions/Calculations/Conclusion templates are encoded ONLY in the generation prompt and the format-compliance regex. There's no human-readable AUTHORING.md. The new session is asked to ship one. The plan: stratified sample of 1,000 questions (33 per track × level cell) with full Gemini gate suite (math + coherence + level_fit + bridge) at ~2,900 calls across ~12 days at the 250/day cap. Full-corpus audit (~27,400 calls / ~110 days) is infeasible; sampling captures any failure mode at >5-10% rate. Includes: - Concrete numbers from the corpus survey (failure counts by category) - Day-by-day execution plan with resume instructions - Daily cost-ledger format - Stopping rules - Post-audit cleanup → paper.tech update path - Mechanical (no-Gemini) cleanups the new session can do in parallel with the daily audit cycle (provenance fix, format markers, AUTHORING.md) CHAIN_ROADMAP.md Progress Log entry points the resumable cursor at this plan.	2026-05-02 11:29:57 -04:00
Vijay Janapa Reddi	a74c98576e	Merge origin/dev into yaml-audit Sync the yaml-audit branch with the latest dev work since the previous sync (`5c5af75ed`). Brings in 73 commits including: - CI security fixes: postcss XSS bump, uuid bounds bump, codeql paths-ignore for vendored bundles, read-only token on staffml-validate-vault workflow - kits/ dark mode polish: code-block readability, dropdown contrast - vault-cli/: pre-commit ruff hook + 20 ruff fixes, all-contributors auto-credit workflow change to pull_request_target - dev's earlier merge of yaml-audit (`836d481b5`) carrying the pre-trailer-strip Phase 1/2/3 history; this merge harmonises that with the current trailer-clean yaml-audit tip - misc bug fixes (tinytorch perceptron seed, infra workflows, socratiq vite dev injector) Conflicts resolved (if any) preserve the yaml-audit-side authoritative state for vault/* files (we own those) and the dev-side authoritative state for .github/workflows/* and other shared infrastructure. # Conflicts: # .github/workflows/all-contributors-auto-credit.yml # .github/workflows/staffml-preview-dev.yml # interviews/staffml/src/data/corpus-summary.json # interviews/staffml/src/data/vault-manifest.json # interviews/staffml/tests/chain-and-vault-smoke.mjs # interviews/vault-cli/README.md # interviews/vault-cli/docs/CHAIN_ROADMAP.md # interviews/vault-cli/scripts/build_chains_with_gemini.py # interviews/vault-cli/scripts/generate_question_for_gap.py # interviews/vault-cli/scripts/merge_chain_passes.py # interviews/vault-cli/scripts/validate_drafts.py # interviews/vault-cli/src/vault_cli/legacy_export.py # interviews/vault-cli/tests/test_chain_validation.py # interviews/vault/.gitignore # interviews/vault/ARCHITECTURE.md # interviews/vault/chains.json # interviews/vault/id-registry.yaml # interviews/vault/questions/edge/optimization/edge-2536.yaml # interviews/vault/questions/mobile/deployment/mobile-2147.yaml # tinytorch/src/03_layers/03_layers.py	2026-05-02 11:06:43 -04:00
Vijay Janapa Reddi	615d3484ad	fix(vault-cli): audit_math.py — handle output path outside REPO_ROOT The "wrote {path}" line at end-of-run called Path.relative_to(REPO_ROOT) unconditionally, which raised when --output was set to a /tmp/ path (e.g., during smoke-testing). Same fix as validate_drafts.py earlier: fall back to displaying the absolute path when relative_to fails. Surfaced while smoke-testing audit_math.py with --output /tmp/... before pointing it at the real _pipeline/ destination.	2026-05-02 10:53:39 -04:00
Vijay Janapa Reddi	825d9571a6	chore: remove archived content and refresh contributor docs - Remove retired _archive/ and scripts/archive/ trees (site, book filters, games, vault); vault CHANGELOG points to git history for old scripts. - CONTRIBUTING: site project row, site/ in area map, root vs TinyTorch pre-commit, vault schema drift wording. - Newsletter CLI: path-agnostic news alias; tinytorch pre-commit comments; add tools/ and staffml-vault-types READMEs for maintainers.	2026-05-02 10:48:00 -04:00
Vijay Janapa Reddi	cd37a5290c	feat(vault-cli): format compliance gate + audit_math.py verifier Two additions to the Phase 3 verification stack: 1. validate_drafts.py: new gate_format_compliance (Gate 1.5). Cheap regex check — no Gemini call. Verifies that the prose-block conventions our schema doesn't enforce are present: - common_mistake (when present): Pitfall / Rationale / Consequence - napkin_math (when present): Assumptions / Calculations / Conclusion Either field is optional in the schema; the gate only flags present-but-malformed cases. Smoke-tested against 5 cases (clean, missing-pitfall, missing-calculations, no-fields, optional-absent). 2. New scripts/audit_math.py: standalone, focused math verifier. For each question, runs ONE Gemini call to re-derive every napkin_math calculation from scratch and compare against what's written. Returns a verdict on: - arithmetic_correct - unit_conversions_correct - conclusion_follows - errors[] (specific issues with quoted lines) Use cases: pre-promotion gate on Phase 3 drafts, retroactive audit of any subset of the published corpus. Internal parallelism via ThreadPoolExecutor (default 4 workers, capped at 8 to stay under typical Gemini RPM limits). Modes: --drafts-only, --files <paths...>, --sample-track + --sample-size.	2026-05-02 10:10:08 -04:00
Vijay Janapa Reddi	64d546de55	feat(vault-cli): tighten validate_drafts coherence + level_fit gates The 2026-05-02 audit caught failure modes the existing validate_drafts.py judges let through: 2 of 4 drafts that all 4 gates passed (mobile-2146 physical absurdity, edge-2537 cognitive-load inflation) were rejected by the independent audit. This commit tightens the coherence and level_fit prompts to catch those modes explicitly. gate_coherence — explicit failure-mode taxonomy: 1. PHYSICAL ABSURDITY: numbers violating real-world hardware bounds (NPU wake-up >50ms, off-class power figures, latency >5× off for named hardware, duty-cycling that defeats the use-case). 2. VENDOR FABRICATION: invented hardware / framework / benchmark names. Conservative — only flag clearly invented, not plausible- but-unverified. 3. SCENARIO/Q/SOLUTION MISMATCH: question doesn't follow scenario; solution doesn't answer the question; cross-field number contradictions. 4. ARITHMETIC ERRORS in napkin_math. Output now includes a "failure_mode" field for the rationale to hang on; the verdict is unchanged in shape ("yes"\|"no"). gate_level_fit — explicit "level inflation" check: - L3+ stamped on a question that's actually L1/L2 (recall + simple multiplication with all inputs given) → reject. - Verb mismatch (the question's verb is more than 1 Bloom step from the level field's expected verb) → reject. - L4+ requires real decomposition / root-cause / trade-off; mechanical computation with all inputs provided is not L4. Re-validation against the original Phase 3 pilot drafts (5 calls × 3 gates = 15 Gemini calls): mobile-2147 accept → pass on all 4 ✓ (matches audit "accept") edge-2536 accept → pass on all 4 ✓ (matches audit "edit-then-publish"; 80ms→15ms latency edit shipped earlier) edge-2537 reject → fail level_fit ✓ ("level inflation: simple arithmetic with all inputs upfront") mobile-2146 reject → fail level_fit ✓ ("0.5s NPU wake-up physically + coherence absurd; dashcam idle 75% would miss accidents") edge-2535 reject → fail originality ✓ (cos=0.933 vs edge-1883; + coherence coherence now ALSO catches: "solution doesn't actually perform the calculation") 100% agreement with the independent audit. No false-positives on the legitimate drafts. Cost: 15 Gemini calls for the re-validation. Going forward, each draft eats 3 judge calls (level_fit + coherence + bridge) — same as before; the prompts are bigger but the call count is unchanged.	2026-05-02 09:56:16 -04:00
Vijay Janapa Reddi	b84691e440	feat(vault-cli): generate_question_for_gap pre-filter for hallucinated gaps The 2026-05-02 audit found ~70% of detected chain gaps are hallucinated — the two anchor questions don't share a scenario thread, so a "bridge" between them is fictional. Without this gate, generating from the existing 407-gap backlog would waste ~75% of the budget (1 generation call + 3 downstream-judge calls per bad gap). Adds a 1-call pre-filter via call_gemini_prefilter. The judge sees the gap entry plus the two anchors in full and returns: { "verdict": "real" \| "hallucinated", "anchors_share_scenario": "yes" \| "no", "level_makes_sense": "yes" \| "no", "rationale": "<one sentence>" } Hallucinated → process_gap returns ok=False with the prefilter verdict captured for review. Real → falls through to generation (unchanged downstream behaviour). Cost analysis at 70% hallucination rate, 30-gap batch: Before: 30 generations + 90 judge calls = 120 calls; ~24 promotable drafts After: 30 prefilter + ~9 generations + 27 judge calls = 66 calls; ~7 promotable drafts (same yield, half the cost) Skip the pre-filter with --skip-prefilter when re-validating an already-filtered gap list or for cost-debugging. Default is filter ON. Smoke checks (mock prefilter responses): - "real" → process_gap returns ok=True, falls through to generation - "hallucinated" → ok=False, why="pre-filter: hallucinated gap (...)" - --skip-prefilter → no pre-filter call, dry_run shows the prompt	2026-05-02 09:49:48 -04:00
Vijay Janapa Reddi	5225059754	fix(vault-cli): clear ruff violations flagged by --all-files sweep Auto-fix removed extraneous f-string prefixes, unused imports (re, sys, textwrap, defaultdict), an unused local (qids), and converted datetime.now(timezone.utc) to datetime.now(UTC) (UP017). Manual fixes split colon/semicolon one-liners onto separate lines (E701/E702), renamed unused loop vars (cid, chain_id) with leading underscores (B007), replaced bare except with except Exception (E722), and renamed loop var L to level to satisfy N806.	2026-05-02 09:17:15 -04:00
Vijay Janapa Reddi	2b3cf5e1da	chore(vault): consolidate AI pipeline artifacts under _pipeline/ Establishes one ignored subdirectory for ALL intermediate outputs of LLM-driven tooling (chain proposals, gap detection, draft scorecards, audit traces). Single gitignore rule: /_pipeline/. Convention is documented in interviews/vault/README.md under "Pipeline artifacts" — it's a real project layout convention, not AI-specific config. Path migration: interviews/vault/chains.proposed.json → _pipeline/chains.proposed.json interviews/vault/gaps.proposed.json → _pipeline/gaps.proposed.json interviews/vault/draft-validation-scorecard.json → _pipeline/draft-validation-scorecard.json interviews/vault/audit-runs/ → _pipeline/runs/ 8 scripts updated to define a PIPELINE_DIR constant and route default outputs through it: build_chains_with_gemini.py, apply_proposed_chains.py, merge_chain_passes.py, validate_drafts.py, audit_chains_with_gemini.py, generate_question_for_gap.py, summarize_proposed_chains.py, promote_drafts.py. Forward-looking docs (README.md chain-pipeline section + CHAIN_ROADMAP.md resume instructions + state snapshot) updated to reference the new paths. Historical Progress Log entries left as-is — they accurately describe what was committed at the time. Drive-by .gitignore fixes (both used full repo-relative paths under package-local .gitignore files, which never matched): interviews/vault-cli/.gitignore: scripts/.calibration_cache/ interviews/vault/.gitignore: /embeddings.npz Validation: - vault check --strict: 10,705 loaded, 0 invariant failures - pytest interviews/vault-cli/tests/: 74/74 - audit --dry-run: paths resolve correctly to _pipeline/runs/<ts>/ No durable corpus content moves. chains.json (live registry), id-registry.yaml, questions/, etc. all stay where they were.	2026-05-02 09:04:55 -04:00
Vijay Janapa Reddi	270b1a5bd2	fix(vault): drop 55 Δ=0 chains + remove Δ=0 from lenient mode Action on the strongest finding from the 2026-05-01 independent audit: 54 of 55 Δ=0 chains had no shared scenario (the "two questions sharing a scenario thread" constraint the lenient prompt was supposed to enforce). Two independent audit fields agreed (verdict=bad and shared_scenario=no), so this isn't a tuning question — the design choice was wrong. Why remove Δ=0 entirely rather than tighten the prompt: - The chain definition is "pedagogical progression through Bloom levels"; same-level edges contradict the definition. - The "shared scenario / different angle" carve-out is unenforceable by an LLM at corpus scale (audit confirmed). - Same-scenario same-level pairs are more honestly modeled as siblings of a chain anchor, not as chain members. Changes: - chains.json: 879 → 824. Dropped: 55 chains (all tier=secondary, since Δ=0 was only ever produced by the lenient sweep). Per-track: edge -19, tinyml -12, mobile -10, cloud -7, global -7. - build_chains_with_gemini.py: MODE_CONFIG["lenient"]["allowed_deltas"]: {0,1,2,3} → {1,2,3} LENIENT_PROMPT_TEMPLATE: Δ=0 paragraph rewritten to explicitly REJECT same-level pairs (with rationale citing the audit). docstring + --mode help text updated. - tests/test_chain_validation.py: test_lenient_accepts_same_level_pair → test_lenient_rejects_same_level_pair header docstring updated to reflect the new rule. - vault-manifest.json: chainCount 879 → 824, releaseHash rolls to 479811040b7a… (real content delta, not a timestamp churn). Validation: - vault check --strict: 10,705 loaded, 0 failures - vault build --local-json: chainCount=824, releaseHash=479811040b… - pytest: 74/74 - playwright chain-and-vault-smoke: 19/19 (fixtures cloud-0001 + cloud-0231 are still in their chains post-drop) Audit findings #2 (gap detection ~50% noise) and #3 (4 pilot drafts disposition) remain open — see CHAIN_ROADMAP.md Progress Log.	2026-05-02 08:51:49 -04:00
Vijay Janapa Reddi	b68f6dbf83	audit(vault): independent Gemini audit — 18 calls, 3 critical findings Ran audit_chains_with_gemini.py end-to-end. 18 Gemini-3.1-pro-preview calls (well under the 250/day cap) sized to 80-336K char prompts (the attention sweet spot at ~80-100K input tokens). Per-call traces under interviews/vault/audit-runs/20260501T213817Z/, rollup at interviews/vault/audit-runs/AUDIT_REPORT.md. Three critical findings the pipeline's own gates missed: 1. Δ=0 chains are ~98% bad (54/55 judged "bad", 54/55 judged "shared_scenario_for_d0_pair: no"). The lenient prompt's constraint that Δ=0 only fire for shared-scenario pairs didn't bind in practice. 6% of chains.json is affected. 2. Gap detection is ~50% noise. 21 of 40 sampled gaps judged "hallucinated" — anchors don't share a scenario thread. Phase 3 generation should pre-filter gaps before issuing the call. 3. Pilot draft pass rate was inflated by validate_drafts.py's LLM judges: mobile-2147 accept edge-2536 edit (scenario truncation) edge-2537 REJECT (cognitive load too low for L3) mobile-2146 REJECT (physically absurd 0.5s/4W NPU wake-up) Calibration findings: - Primary chains (n=100): 64% good, 22% weak, 14% bad - Secondary chains (n=100): 61% good, 33% weak, 6% bad - Tier delta vs primary is small at "good" — the actual quality cliff in secondary is concentrated in the Δ=0 subset. No autonomous fixes filed — per agreement, audit produces findings only. CHAIN_ROADMAP.md Progress Log spells out the three concrete decisions for next session (drop / demote / rebuild Δ=0; pre-filter gaps; disposition the 4 drafts per AUDIT_REPORT.md). Total Gemini calls this session: 55 (Phase 1.4 + Phase 3 pilot + audit).	2026-05-01 18:04:36 -04:00
Vijay Janapa Reddi	66c10e6f2b	feat(vault-cli): audit_chains_with_gemini.py — independent pipeline audit Single-driver script that runs an independent Gemini audit over the Phase 1-3 chain pipeline output. Designed as a complementary check to the pipeline's own validation gates (Pydantic schema, embedding cosine, multiple LLM judges) — runs an INDEPENDENT model pass over what would otherwise be human-spot-check territory. Categories (5 audit + 1 synthesis call = ~18 total calls, well under the 250/day Pro cap): 1. drafts 4 Phase 3 promoted drafts: independent quality gate (fabrication, level fit, answer correctness, scenario realism — failure modes the existing judges miss) 2. secondary 100-chain sample of tier=secondary chains 3. delta_zero All 55 Δ=0 chains (highest-risk lenient additions) — verifies the "shared scenario" claim per-pair 4. primary 100-chain sample of tier=primary chains (regression check on strict-pass quality) 5. gaps 50-gap sample with the two between-questions in full (real bridge vs hallucination) 6. synthesis 1 wrap-up call → AUDIT_REPORT.md A previously-planned tier_compare category was dropped: 0 buckets carry both primary and secondary chains (the lenient sweep was scoped to uncovered buckets, by definition disjoint). Per-tier quality is inferred from categories 2 and 4 by the synthesis call. Per-call target: ~80K input tokens (320K char prompts) — the attention sweet spot. Chain payloads at ~2-3K chars each pack ~50 chains into one such prompt. Outputs land in interviews/vault/audit-runs/<UTC-timestamp>/ config.json — what was sampled, with seed for reproducibility 0N_<category>.json — per-call prompt-char count, IDs, raw response And one human-readable rollup at interviews/vault/AUDIT_REPORT.md. Modes: --dry-run (plan only), --only <cat>, --skip <cat,...>, --seed (for reproducible re-runs). Findings only — never edits chains.json or any question YAML. Issues surfaced for human review.	2026-05-01 17:38:00 -04:00
Vijay Janapa Reddi	bc553017b4	docs(vault): roadmap status + Phase 3 authoring conventions D-cleanups folded into one commit: - CHAIN_ROADMAP.md status header reflects current state (Phase 1+2 complete, Phase 3 pilot landed, Phase 4 mostly shipped). - Phase 4.1 / 4.6 / 4.7 / 4.9 entries marked complete with commit refs. - ARCHITECTURE.md gains a §3.6.1 documenting the two YAML-body conventions introduced when LLM-authored questions started landing in Phase 3: - _authoring private metadata block on drafts (stripped at promotion) - gap-bridge:<from>-<to> tag added at promotion for traceability Neither is schema-enforced (Pydantic accepts extra); both are stable across the pipeline. No code changes.	2026-05-01 17:33:36 -04:00
Vijay Janapa Reddi	c92effc269	feat(vault-cli): Phase 4.7 — chain decay detection (advisory) Detects chain members that have drifted semantically away from their chain mates after an edit. Re-embeds changed YAMLs with the same model the corpus uses (BAAI/bge-small-en-v1.5) and reports the min cosine to each chain mate. Default invocation (advisory): python3 scripts/check_chain_decay.py # diffs against origin/dev, flags chains with min mate-cosine < 0.40 Other modes: --files <a.yaml> <b.yaml> explicit files instead of git diff --base HEAD~5 different base ref --threshold 0.50 tighter cutoff (slow drift detection) --strict exit non-zero on flag (use as CI gate) Default is advisory not blocking — first ship intentionally doesn't fail commits or CI. The threshold 0.40 is calibrated against the post-Phase-1 corpus; tune as needed once you've seen what real-edit deltas look like in practice. Implementation notes: - Reuses embeddings.npz for chain-mate vectors (no re-embedding the whole corpus per run). - Only the changed question gets re-embedded — fast for typical PR-sized changes. - Skips changed questions that aren't in chains; skips chain memberships where the mate isn't in embeddings.npz (e.g., the Phase 3 promoted drafts before they hit the next embedding rebuild). Smoke checks: - --base origin/dev finds 4 changed YAMLs (the Phase 3 promoted drafts), correctly reports no chain memberships (those questions aren't in chains.json yet — by design, gated on human review). - --files <cloud-2520.yaml> on a real chain member: cos=0.79 vs its L5 mate cloud-2521 (well above 0.40 threshold ✓).	2026-05-01 17:31:30 -04:00
Vijay Janapa Reddi	de46921cfe	docs(vault-cli): PHASE_3_REVIEW_GUIDE.md — human review handoff Walkthrough for reviewing LLM-authored question drafts produced by generate_question_for_gap.py + validate_drafts.py. Covers: - what each of the 5 gates catches and (critically) misses - what to read in what order, with watchpoints for the failure modes that LLM gates routinely let through (vendor-name fabrication, arithmetic drift, level-stamping mismatches) - decision tree: promote (publish vs draft), edit + retry, reject - exact promote_drafts.py invocations for each path - rough scorecard summary for the 4 pilot drafts shipped in `a750ab7bc`, ready for the user's review pass Designed for ~10-15 min of reading per pilot batch.	2026-05-01 17:24:07 -04:00
Vijay Janapa Reddi	12b35a0929	feat(vault-cli): promote_drafts.py — one-command Phase 3.d helper Closes the loop on the pilot pattern from `a750ab7bc` (manual promotion inline script). Reads draft-validation-scorecard.json and either promotes every passing draft (--all-passing) or an explicit list (--qids edge-2536,edge-2537). Per draft: - strips _authoring private metadata; replaces with proper schema fields (provenance, status, authors, human_reviewed, created_at) - adds gap-bridge:<lower>-<higher> tag for traceability - renames .yaml.draft → .yaml - appends id to id-registry.yaml (append-only — preserves the CI-enforced ledger contract) Optional flags: --publish flip status to published (default: keep as draft so the human reviewer's workflow stays explicit) --reviewed-by X set human_reviewed.status=verified, by=X, date=now (implies the reviewer has actually read the drafts) --dry-run preview without writing Refuses to overwrite a <id>.yaml that already exists. Skips already-promoted drafts (with a warning) when called with --all-passing on a scorecard whose drafts have been promoted earlier. Smoke checks: - --all-passing on the existing scorecard correctly identifies all 4 pilot drafts as already-promoted (they shipped in `a750ab7bc`). - --qids edge-2535 --dry-run on the leftover failed-validation draft previews the promotion as expected.	2026-05-01 17:22:45 -04:00

1 2 3 4

154 Commits