mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-10 15:49:25 -05:00
Ran audit_chains_with_gemini.py end-to-end. 18 Gemini-3.1-pro-preview
calls (well under the 250/day cap) sized to 80-336K char prompts (the
attention sweet spot at ~80-100K input tokens). Per-call traces under
interviews/vault/audit-runs/20260501T213817Z/, rollup at
interviews/vault/audit-runs/AUDIT_REPORT.md.
Three critical findings the pipeline's own gates missed:
1. Δ=0 chains are ~98% bad (54/55 judged "bad", 54/55 judged
"shared_scenario_for_d0_pair: no"). The lenient prompt's
constraint that Δ=0 only fire for shared-scenario pairs didn't
bind in practice. 6% of chains.json is affected.
2. Gap detection is ~50% noise. 21 of 40 sampled gaps judged
"hallucinated" — anchors don't share a scenario thread. Phase 3
generation should pre-filter gaps before issuing the call.
3. Pilot draft pass rate was inflated by validate_drafts.py's LLM
judges:
mobile-2147 accept
edge-2536 edit (scenario truncation)
edge-2537 REJECT (cognitive load too low for L3)
mobile-2146 REJECT (physically absurd 0.5s/4W NPU wake-up)
Calibration findings:
- Primary chains (n=100): 64% good, 22% weak, 14% bad
- Secondary chains (n=100): 61% good, 33% weak, 6% bad
- Tier delta vs primary is small at "good" — the actual quality
cliff in secondary is concentrated in the Δ=0 subset.
No autonomous fixes filed — per agreement, audit produces findings
only. CHAIN_ROADMAP.md Progress Log spells out the three concrete
decisions for next session (drop / demote / rebuild Δ=0; pre-filter
gaps; disposition the 4 drafts per AUDIT_REPORT.md).
Total Gemini calls this session: 55 (Phase 1.4 + Phase 3 pilot + audit).