Files
cs249r_book/interviews/vault-cli/docs
Vijay Janapa Reddi b68f6dbf83 audit(vault): independent Gemini audit — 18 calls, 3 critical findings
Ran audit_chains_with_gemini.py end-to-end. 18 Gemini-3.1-pro-preview
calls (well under the 250/day cap) sized to 80-336K char prompts (the
attention sweet spot at ~80-100K input tokens). Per-call traces under
interviews/vault/audit-runs/20260501T213817Z/, rollup at
interviews/vault/audit-runs/AUDIT_REPORT.md.

Three critical findings the pipeline's own gates missed:

  1. Δ=0 chains are ~98% bad (54/55 judged "bad", 54/55 judged
     "shared_scenario_for_d0_pair: no"). The lenient prompt's
     constraint that Δ=0 only fire for shared-scenario pairs didn't
     bind in practice. 6% of chains.json is affected.

  2. Gap detection is ~50% noise. 21 of 40 sampled gaps judged
     "hallucinated" — anchors don't share a scenario thread. Phase 3
     generation should pre-filter gaps before issuing the call.

  3. Pilot draft pass rate was inflated by validate_drafts.py's LLM
     judges:
       mobile-2147  accept
       edge-2536    edit (scenario truncation)
       edge-2537    REJECT (cognitive load too low for L3)
       mobile-2146  REJECT (physically absurd 0.5s/4W NPU wake-up)

Calibration findings:
  - Primary chains (n=100): 64% good, 22% weak, 14% bad
  - Secondary chains (n=100): 61% good, 33% weak, 6% bad
  - Tier delta vs primary is small at "good" — the actual quality
    cliff in secondary is concentrated in the Δ=0 subset.

No autonomous fixes filed — per agreement, audit produces findings
only. CHAIN_ROADMAP.md Progress Log spells out the three concrete
decisions for next session (drop / demote / rebuild Δ=0; pre-filter
gaps; disposition the 4 drafts per AUDIT_REPORT.md).

Total Gemini calls this session: 55 (Phase 1.4 + Phase 3 pilot + audit).
2026-05-01 18:04:36 -04:00
..