4 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
463a180258 fix(vault-cli): _judges adds --skip-trust to gemini invocation
The gemini CLI silently overrides --yolo to default approval mode when
its cwd is not in the trusted-folders list (e.g., a tempfile.gettempdir
scratch dir). The override is logged to stderr as 'Approval mode
overridden to "default" because the current folder is not trusted'
and the call exits 55. --skip-trust opts out of that gate. Verified
2026-05-04 in /tmp/gemini-trust-test.
2026-05-04 10:35:13 -04:00
Vijay Janapa Reddi
2d9330da67 fix(vault-cli): isolate gemini CLI scratch files in temp dir
The gemini CLI in --yolo mode occasionally writes scratch files
(prompt_candidates.json, audit.py, evaluate_*.py, partial JSON outputs)
to its CWD. When invoked from the repo root those landed alongside the
worktree and polluted git status with ~30 untracked files.

Fix: pass cwd=tempfile.gettempdir()/vault_audit_gemini_scratch to
subprocess.run. The scratch dir is created lazily on import.

This doesn't affect Gemini's output (we capture stdout) or the
prompt (we pass via -p). It just keeps the gemini CLI's incidental
file-system side effects out of the worktree.

CORPUS_HARDENING_PLAN.md Phase 3 (delayed reliability fix).
2026-05-03 11:08:53 -04:00
Vijay Janapa Reddi
12032f700c fix(vault-cli): audit_corpus_batched.py reliability fixes from canary
Three bugs surfaced by the global-track canary run (2026-05-03,
20260503T123116Z), all fixed:

1. Gemini-CLI subprocess timeout was 240s; canary's average call took
   ~167s with 72K-char prompts occasionally exceeding 240s and getting
   killed mid-call. 60 questions (2 batches) returned no Gemini
   response. Bumped default timeout in _judges.call_gemini_judge()
   to 600s (≈3× typical, still triggers fast on a stuck call).

2. Resume logic in run_audit() treated ANY persisted row as "audited,"
   including the placeholder rows for batches that errored. That meant
   re-running on the same output dir would skip the failed batches
   forever. Fixed: only rows with format_compliance != "error" are
   added to seen_qids, so a re-run retries the failures.

3. --output passed as a relative path crashed on
   `outdir.relative_to(REPO_ROOT)` because relative paths don't share
   the absolute REPO_ROOT prefix. Fixed: resolve outdir to absolute
   immediately after computing it.

Validation: re-ran the canary on the same output dir with all three
fixes. Resume correctly skipped the 9 good batches, retried the 2
errored batches, and both completed cleanly in 785s. All 313 global
questions now have real Gemini verdicts (0 errors).

Canary findings:
  format_compliance: 21 fails, 99.6% Gemini-vs-regex agreement
  level_fit:         48 fails (15.3% — the predicted level-inflation
                                pattern; flagged for Phase 5 review)
  coherence:         18 fails
  math_correct:      8 fails
  title_quality:     16 placeholders (matches regex 1:1)

CORPUS_HARDENING_PLAN.md Phase 4 (canary leg).
2026-05-03 09:18:30 -04:00
Vijay Janapa Reddi
dd71c66cae feat(vault-cli): _judges.py + _batching.py — shared infra for batched audit
Two new helper modules under interviews/vault-cli/scripts/. Used by the
upcoming audit_corpus_batched.py (CORPUS_HARDENING_PLAN.md Phase 3) and
extractable from the existing single-call scripts in a follow-up.

_judges.py exports:
  - GEMINI_MODEL                (pinned)
  - COMMON_MISTAKE_MARKERS      (Pitfall/Rationale/Consequence)
  - NAPKIN_MATH_MARKERS         (Assumptions/Calculations/Conclusion)
  - FAILURE_MODE_TAXONOMY       (4-mode prose block: physical absurdity,
                                 vendor fabrication, mismatch, arithmetic)
  - call_gemini_judge()         (subprocess wrapper + lenient JSON parse)
  - strip_fences()              (response cleanup)
  - gate_format()               (regex format-compliance gate, free)

The taxonomy is the same prose block currently inlined in
validate_drafts.py's COHERENCE_PROMPT and audit_chains_with_gemini.py's
audit prompts. Centralizing it means a future failure-mode addition
flows to every judge, not just one script.

_batching.py exports:
  - MAX_PROMPT_CHARS = 320_000  (≈80K tokens, attention sweet spot)
  - DEFAULT_WRAPPER_CHARS       (4K headroom for prompt scaffolding)
  - pack_batches[T]()           (generic char-budgeted batcher with
                                 optional hard item cap)

Generalized from audit_chains_with_gemini.py:batch_chains and
build_chains_with_gemini.py:plan_batches. Properties documented in the
docstring (preserves order, no items lost, oversized items still land
in a batch).

Followups:
- migrate validate_drafts.py and audit_chains_with_gemini.py to use
  _judges.call_gemini_judge instead of their inlined wrappers (out of
  scope here; non-blocking for the audit work).

CORPUS_HARDENING_PLAN.md Phase 3.
2026-05-03 08:22:39 -04:00