mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-06 17:49:07 -05:00
Replaces the dead-end audit_corpus.py (deleted in Phase 0). The new
design batches 30-40 questions per Gemini call instead of 1 question
per gate, dropping the corpus-audit cost by ~10×.
Per call, ONE prompt asks Gemini for a JSON array of per-question
verdicts across:
- format_compliance: pass/fail (regex-checkable; cross-checked
against host-side gate_format)
- level_fit: pass/fail/skip + rationale (level inflation
+ verb mismatch + "no real judgement required")
- coherence: pass/fail + failure_mode (physical_absurdity /
vendor_fabrication / mismatch / arithmetic)
- math_correct: pass/fail/no_math + specific errors
- title_quality: good/placeholder/malformed
Cost (full corpus, 9,446 published):
- audit-only: ~315 calls (1.3 days at the 250/day cap)
- --propose-fixes: ~+50% (denser per-batch output → smaller batches)
Modes:
--all full corpus (default)
--tracks cloud,edge track filter
--qids X,Y,Z explicit qid set
--propose-fixes ALSO ask Gemini to propose corrections
(per CORPUS_HARDENING_PLAN.md §10:
- math errors: rewrite napkin_math AND
realistic_solution as a UNIT
- level inflation: relabel DOWN, never
attempt to rewrite the question up)
--max-calls N cap per invocation; resume by re-running
--batch-size N tuning override
--dry-run plan without calling Gemini
Output convention: _pipeline/runs/<UTC-timestamp>/
00_config.json — flags, model, candidate count
01_audit.json — per-question rows (resumable; rewritten after
each batch so a Ctrl-C / timeout doesn't lose work)
Sanity check: dry-run on full corpus packs 9,446 questions into 315
batches of 30, with payloads 55-69KB each (well under the 320KB
attention sweet spot for gemini-3.1-pro-preview).
CORPUS_HARDENING_PLAN.md Phase 3.