cs249r_book

github-starred/cs249r_book

Fork 0

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-07 18:18:42 -05:00

Commit Graph

Author	SHA1	Message	Date
Vijay Janapa Reddi	d2621cc9ed	feat(vault-cli): merge_audit_runs.py + Phase 4 findings doc merge_audit_runs.py — merges multiple per-track audit_corpus_batched output dirs into one canonical run. Per-qid prefer non-error rows, then rows with suggested_corrections. AUDIT_FINDINGS_2026-05-03.md — first complete corpus audit. summarize_audit.py — truncate rationale snippets at word boundaries (was truncating mid-word, tripping codespell on words like 'claimin'). Phase 4 final stats (9,446 published questions audited): format_compliance: ~960 fail level_fit: ~1,580 fail coherence: ~480 fail math_correct: ~330 fail title_quality: ~250 placeholder + ~25 malformed 20 error rows in global to retry on next run 1,767 questions have suggested_corrections; ~1,500 more need a propose-fixes backfill pass (mostly cloud, some edge). CORPUS_HARDENING_PLAN.md Phase 4 finalization.	2026-05-03 14:26:37 -04:00
Vijay Janapa Reddi	3eaac3ca93	feat(vault-cli): summarize_audit.py — Phase 4 finalization helper Reads a 01_audit.json and emits a markdown triage doc with: - executive summary (per-gate pass/fail/error counts) - per-track failure rate matrix - coherence failure-mode breakdown (vendor fabrication / physical absurdity / mismatch / arithmetic) - priority lists for human review (math errors highest, then coherence by mode, then level inflation, then placeholder titles) - regex-vs-Gemini format-disagreement audit - recommended next-step actions per category Usage: python3 interviews/vault-cli/scripts/summarize_audit.py \\ --input <01_audit.json> \\ --output interviews/vault-cli/docs/AUDIT_FINDINGS_<date>.md Smoke-tested on the in-flight Phase 4 audit (2,880 cloud rows so far). Sample findings on the partial data: - 95 math errors caught (real ones — e.g. "200 * 4 * 168 = 134,400, not $168k/week") - 385 level_fit fails (13.4% — slightly higher than global's 15.3%) - 146 coherence fails: 61 mismatch / 47 arithmetic / 37 physical absurdity / 1 vendor fabrication - 118 placeholder titles in cloud (Gemini flags more than just the "Global New NNNN" pattern; Phase 7 scope is bigger than the plan estimated) CORPUS_HARDENING_PLAN.md Phase 4.	2026-05-03 11:06:26 -04:00

Author

SHA1

Message

Date

Vijay Janapa Reddi

d2621cc9ed

feat(vault-cli): merge_audit_runs.py + Phase 4 findings doc

merge_audit_runs.py — merges multiple per-track audit_corpus_batched
output dirs into one canonical run. Per-qid prefer non-error rows,
then rows with suggested_corrections.

AUDIT_FINDINGS_2026-05-03.md — first complete corpus audit.

summarize_audit.py — truncate rationale snippets at word boundaries
(was truncating mid-word, tripping codespell on words like 'claimin').

Phase 4 final stats (9,446 published questions audited):
  format_compliance:   ~960 fail
  level_fit:          ~1,580 fail
  coherence:            ~480 fail
  math_correct:         ~330 fail
  title_quality:        ~250 placeholder + ~25 malformed
  20 error rows in global to retry on next run

1,767 questions have suggested_corrections; ~1,500 more need a
propose-fixes backfill pass (mostly cloud, some edge).

CORPUS_HARDENING_PLAN.md Phase 4 finalization.

2026-05-03 14:26:37 -04:00

Vijay Janapa Reddi

3eaac3ca93

feat(vault-cli): summarize_audit.py — Phase 4 finalization helper

Reads a 01_audit.json and emits a markdown triage doc with:
  - executive summary (per-gate pass/fail/error counts)
  - per-track failure rate matrix
  - coherence failure-mode breakdown (vendor fabrication / physical
    absurdity / mismatch / arithmetic)
  - priority lists for human review (math errors highest, then
    coherence by mode, then level inflation, then placeholder titles)
  - regex-vs-Gemini format-disagreement audit
  - recommended next-step actions per category

Usage:
  python3 interviews/vault-cli/scripts/summarize_audit.py \\
      --input <01_audit.json> \\
      --output interviews/vault-cli/docs/AUDIT_FINDINGS_<date>.md

Smoke-tested on the in-flight Phase 4 audit (2,880 cloud rows so far).
Sample findings on the partial data:
  - 95 math errors caught (real ones — e.g. "200 * 4 * 168 = 134,400,
    not $168k/week")
  - 385 level_fit fails (13.4% — slightly higher than global's 15.3%)
  - 146 coherence fails: 61 mismatch / 47 arithmetic / 37 physical
    absurdity / 1 vendor fabrication
  - 118 placeholder titles in cloud (Gemini flags more than just
    the "Global New NNNN" pattern; Phase 7 scope is bigger than the
    plan estimated)

CORPUS_HARDENING_PLAN.md Phase 4.

2026-05-03 11:06:26 -04:00

2 Commits