cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-06 17:49:07 -05:00

Files

Vijay Janapa Reddi 69cf6f0a5f feat(vault-cli): audit_corpus_batched.py — full-corpus batched audit

Replaces the dead-end audit_corpus.py (deleted in Phase 0). The new
design batches 30-40 questions per Gemini call instead of 1 question
per gate, dropping the corpus-audit cost by ~10×.

Per call, ONE prompt asks Gemini for a JSON array of per-question
verdicts across:
  - format_compliance:    pass/fail (regex-checkable; cross-checked
                          against host-side gate_format)
  - level_fit:            pass/fail/skip + rationale (level inflation
                          + verb mismatch + "no real judgement required")
  - coherence:            pass/fail + failure_mode (physical_absurdity /
                          vendor_fabrication / mismatch / arithmetic)
  - math_correct:         pass/fail/no_math + specific errors
  - title_quality:        good/placeholder/malformed

Cost (full corpus, 9,446 published):
  - audit-only:        ~315 calls (1.3 days at the 250/day cap)
  - --propose-fixes:   ~+50% (denser per-batch output → smaller batches)

Modes:
  --all                  full corpus (default)
  --tracks cloud,edge    track filter
  --qids X,Y,Z           explicit qid set
  --propose-fixes        ALSO ask Gemini to propose corrections
                         (per CORPUS_HARDENING_PLAN.md §10:
                          - math errors: rewrite napkin_math AND
                            realistic_solution as a UNIT
                          - level inflation: relabel DOWN, never
                            attempt to rewrite the question up)
  --max-calls N          cap per invocation; resume by re-running
  --batch-size N         tuning override
  --dry-run              plan without calling Gemini

Output convention: _pipeline/runs/<UTC-timestamp>/
  00_config.json    — flags, model, candidate count
  01_audit.json     — per-question rows (resumable; rewritten after
                      each batch so a Ctrl-C / timeout doesn't lose work)

Sanity check: dry-run on full corpus packs 9,446 questions into 315
batches of 30, with payloads 55-69KB each (well under the 320KB
attention sweet spot for gemini-3.1-pro-preview).

CORPUS_HARDENING_PLAN.md Phase 3.

2026-05-03 08:22:58 -04:00

archive

Merge origin/dev into yaml-audit

2026-05-02 11:06:43 -04:00

_batching.py

feat(vault-cli): _judges.py + _batching.py — shared infra for batched audit

2026-05-03 08:22:39 -04:00

_judges.py

feat(vault-cli): _judges.py + _batching.py — shared infra for batched audit

2026-05-03 08:22:39 -04:00

apply_proposed_chains.py

Merge origin/dev into yaml-audit