cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-08 02:28:25 -05:00

Author	SHA1	Message	Date
Vijay Janapa Reddi	463a180258	fix(vault-cli): _judges adds --skip-trust to gemini invocation The gemini CLI silently overrides --yolo to default approval mode when its cwd is not in the trusted-folders list (e.g., a tempfile.gettempdir scratch dir). The override is logged to stderr as 'Approval mode overridden to "default" because the current folder is not trusted' and the call exits 55. --skip-trust opts out of that gate. Verified 2026-05-04 in /tmp/gemini-trust-test.	2026-05-04 10:35:13 -04:00
Vijay Janapa Reddi	2d9330da67	fix(vault-cli): isolate gemini CLI scratch files in temp dir The gemini CLI in --yolo mode occasionally writes scratch files (prompt_candidates.json, audit.py, evaluate_*.py, partial JSON outputs) to its CWD. When invoked from the repo root those landed alongside the worktree and polluted git status with ~30 untracked files. Fix: pass cwd=tempfile.gettempdir()/vault_audit_gemini_scratch to subprocess.run. The scratch dir is created lazily on import. This doesn't affect Gemini's output (we capture stdout) or the prompt (we pass via -p). It just keeps the gemini CLI's incidental file-system side effects out of the worktree. CORPUS_HARDENING_PLAN.md Phase 3 (delayed reliability fix).	2026-05-03 11:08:53 -04:00
Vijay Janapa Reddi	12032f700c	fix(vault-cli): audit_corpus_batched.py reliability fixes from canary Three bugs surfaced by the global-track canary run (2026-05-03, 20260503T123116Z), all fixed: 1. Gemini-CLI subprocess timeout was 240s; canary's average call took ~167s with 72K-char prompts occasionally exceeding 240s and getting killed mid-call. 60 questions (2 batches) returned no Gemini response. Bumped default timeout in _judges.call_gemini_judge() to 600s (≈3× typical, still triggers fast on a stuck call). 2. Resume logic in run_audit() treated ANY persisted row as "audited," including the placeholder rows for batches that errored. That meant re-running on the same output dir would skip the failed batches forever. Fixed: only rows with format_compliance != "error" are added to seen_qids, so a re-run retries the failures. 3. --output passed as a relative path crashed on `outdir.relative_to(REPO_ROOT)` because relative paths don't share the absolute REPO_ROOT prefix. Fixed: resolve outdir to absolute immediately after computing it. Validation: re-ran the canary on the same output dir with all three fixes. Resume correctly skipped the 9 good batches, retried the 2 errored batches, and both completed cleanly in 785s. All 313 global questions now have real Gemini verdicts (0 errors). Canary findings: format_compliance: 21 fails, 99.6% Gemini-vs-regex agreement level_fit: 48 fails (15.3% — the predicted level-inflation pattern; flagged for Phase 5 review) coherence: 18 fails math_correct: 8 fails title_quality: 16 placeholders (matches regex 1:1) CORPUS_HARDENING_PLAN.md Phase 4 (canary leg).	2026-05-03 09:18:30 -04:00
Vijay Janapa Reddi	dd71c66cae	feat(vault-cli): _judges.py + _batching.py — shared infra for batched audit Two new helper modules under interviews/vault-cli/scripts/. Used by the upcoming audit_corpus_batched.py (CORPUS_HARDENING_PLAN.md Phase 3) and extractable from the existing single-call scripts in a follow-up. _judges.py exports: - GEMINI_MODEL (pinned) - COMMON_MISTAKE_MARKERS (Pitfall/Rationale/Consequence) - NAPKIN_MATH_MARKERS (Assumptions/Calculations/Conclusion) - FAILURE_MODE_TAXONOMY (4-mode prose block: physical absurdity, vendor fabrication, mismatch, arithmetic) - call_gemini_judge() (subprocess wrapper + lenient JSON parse) - strip_fences() (response cleanup) - gate_format() (regex format-compliance gate, free) The taxonomy is the same prose block currently inlined in validate_drafts.py's COHERENCE_PROMPT and audit_chains_with_gemini.py's audit prompts. Centralizing it means a future failure-mode addition flows to every judge, not just one script. _batching.py exports: - MAX_PROMPT_CHARS = 320_000 (≈80K tokens, attention sweet spot) - DEFAULT_WRAPPER_CHARS (4K headroom for prompt scaffolding) - pack_batches[T]() (generic char-budgeted batcher with optional hard item cap) Generalized from audit_chains_with_gemini.py:batch_chains and build_chains_with_gemini.py:plan_batches. Properties documented in the docstring (preserves order, no items lost, oversized items still land in a batch). Followups: - migrate validate_drafts.py and audit_chains_with_gemini.py to use _judges.call_gemini_judge instead of their inlined wrappers (out of scope here; non-blocking for the audit work). CORPUS_HARDENING_PLAN.md Phase 3.	2026-05-03 08:22:39 -04:00

4 Commits