Append 2026-05-03 update reflecting: - Phase 4 backfill complete (2,757 corrections proposed; 0 errors) - 6 cloud questions migrated (stray top-level MCQ fields → details) - Phase 8 CLI subcommand shipped (vault audit run/review/summarize/merge) Next session can skip Step 1 (backfill — done) and start at Step 2 (Phase 5 interactive review).
18 KiB
Phase 4 audit handoff — resume guide for the next session
Status as of 2026-05-03 (updated): Phases 0-4 + Phase 4 backfill +
Phase 8 (CLI + cron) complete. Ready for Phase 5 (interactive
review).
Branch: yaml-audit (97 commits ahead of origin/dev, 0 behind — merged into
local dev when ready)
Worktree: /Users/VJ/GitHub/MLSysBook-yaml-audit
Active workplan: interviews/vault-cli/docs/CORPUS_HARDENING_PLAN.md
Update appended 2026-05-03
The handoff doc below was written before the Phase 4 backfill ran. Key deltas to know about:
- Phase 4 backfill is done. Cloud + edge failures that were
audit-only got
--propose-fixespasses. The merged dataset now has 2,757 questions with suggested_corrections (up from 1,767), spanning all 5 tracks. 0 error rows (all retried). - 6 cloud questions migrated.
cloud-{0048,0273,0291,0336,0418,0454}had stray top-leveloptions/correct_index(MCQ data) — moved intodetails:per the schema. Phase 6'sDetails extra='forbid'flip is now safe with no further corpus migrations. - Phase 8 CLI subcommand shipped.
vault audit run / review / summarize / mergewraps the underlying scripts. Cron workflow was already in place.
So the new session can skip Step 1 (backfill) in the doc below and go straight to Step 2 (Phase 5 interactive review).
The merged audit dataset for Phase 5 is at:
interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json
What's done
+ Phase 0 Cleanup deprecated scripts + dead-end audit_corpus.py ✅
+ Phase 1 Backfilled provenance: imported on 407 published YAMLs ✅
+ Phase 2 AUTHORING.md + vault new scaffold with format-marker stubs ✅
+ Phase 3 Built audit_corpus_batched.py + _judges.py + _batching.py ✅
+ Phase 4 Full-corpus audit run ✅ (9,446 / 9,446)
Phase 4 outputs (all in interviews/vault/_pipeline/runs/, gitignored):
full-corpus-20260503/ main run dir (cloud audit-only + edge mixed + global + 140 mobile)
full-corpus-20260503-mobile/ parallel mobile run (1,824 rows with fixes)
full-corpus-20260503-tinyml/ parallel tinyml run (1,202 rows with fixes)
full-corpus-20260503-merged/ ✦ canonical merged dataset — start here
The merged dataset has all 9,446 questions with 1,767 already carrying
suggested_corrections from the --propose-fixes invocations.
Phase 4 findings doc (committed):
interviews/vault-cli/docs/AUDIT_FINDINGS_2026-05-03.md
Final corpus state (Phase 4 results)
| gate | fail | rate |
|---|---|---|
| format_compliance | ~960 | 10.2% |
| level_fit | ~1,580 | 16.7% |
| coherence | ~480 | 5.1% |
| math_correct | ~330 | 3.5% |
| title_quality | ~250 placeholder + ~25 malformed | 2.9% |
| errors (need retry) | 20 (all global) | 0.2% |
Per-track failure rates: tinyml has the highest level-inflation rate (21.4%); cloud has the most absolute math errors. Edge has higher coherence-fail rate than other tracks (7%). See AUDIT_FINDINGS for details and qid lists.
What's left
remaining:
Phase 4 finalization:
- retry 20 errored rows in global (1 invocation, ~1 batch)
- backfill --propose-fixes on cloud's ~1,344 failures (no fixes today)
- backfill --propose-fixes on edge's ~253 failures missing fixes
- re-merge into full-corpus-20260503-merged/
Phase 5 Walk ~3,300 corrections interactively ~6h human review
Phase 6 Schema tightening + lift format gate ~2h
Phase 7 Title-quality pass (~250 placeholders) ~30 calls + review
Phase 8 Add `vault audit` CLI subcommand ~30 min
Phase 9 Update paper.tech, vault publish 1.0.0, tag ~1h
Cron workflow (staffml-audit-corpus-monthly.yml) is already shipped.
How to resume
Step 0 — sanity check the worktree
cd /Users/VJ/GitHub/MLSysBook-yaml-audit
git status # should be clean
git log --oneline -10 # confirm Phase 0-4 commits present
git branch # * yaml-audit
vault check --strict # 10,711 loaded, 0 invariant failures
pytest interviews/vault-cli/tests/ -q # 84 passed
ruff check interviews/vault-cli # clean
Step 1 — finish Phase 4 backfill (~5 invocations of Gemini)
The cloud track was audited with audit-only in invocations 1-7 (yesterday). To get suggested_corrections for cloud's failures, run a propose-fixes pass against the failure-only subset.
Step 1a — list cloud failure qids
python3 -c "
import json
d = json.loads(open('interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json').read())
fails = [
r['qid'] for r in d['rows']
if r['qid'].startswith('cloud-')
and not r.get('suggested_corrections')
and any(r.get(g) == 'fail' for g in ('format_compliance','level_fit','coherence','math_correct'))
]
print(','.join(fails))
" > /tmp/cloud-fails-to-fix.txt
wc -w /tmp/cloud-fails-to-fix.txt # ~1,344 qids
Step 1b — run propose-fixes on those qids
QIDS=$(cat /tmp/cloud-fails-to-fix.txt)
# Run multiple invocations until quota or qids exhausted.
# Each invocation does ~18 batches of 20 questions = ~360 qids.
# Cloud needs ~4 invocations to clear all failures.
for i in 1 2 3 4; do
python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
--qids "$QIDS" \
--propose-fixes \
--workers 8 \
--max-calls 18 \
--output interviews/vault/_pipeline/runs/full-corpus-20260503-cloud-backfill
# The script auto-resumes — already-fixed qids skip on each iteration.
done
Step 1c — same for edge's 253 missing-fix failures
python3 -c "
import json
d = json.loads(open('interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json').read())
fails = [
r['qid'] for r in d['rows']
if r['qid'].startswith('edge-')
and not r.get('suggested_corrections')
and any(r.get(g) == 'fail' for g in ('format_compliance','level_fit','coherence','math_correct'))
]
print(','.join(fails))
" > /tmp/edge-fails-to-fix.txt
QIDS=$(cat /tmp/edge-fails-to-fix.txt)
python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
--qids "$QIDS" \
--propose-fixes \
--workers 8 \
--max-calls 18 \
--output interviews/vault/_pipeline/runs/full-corpus-20260503-edge-backfill
Step 1d — retry the 20 errored global rows
These will resume automatically on any --tracks global run:
python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
--tracks global \
--propose-fixes \
--workers 8 \
--max-calls 18 \
--output interviews/vault/_pipeline/runs/full-corpus-20260503 # main dir, will retry errors
Step 1e — re-merge
python3 interviews/vault-cli/scripts/merge_audit_runs.py \
--inputs interviews/vault/_pipeline/runs/full-corpus-20260503 \
interviews/vault/_pipeline/runs/full-corpus-20260503-mobile \
interviews/vault/_pipeline/runs/full-corpus-20260503-tinyml \
interviews/vault/_pipeline/runs/full-corpus-20260503-cloud-backfill \
interviews/vault/_pipeline/runs/full-corpus-20260503-edge-backfill \
--output interviews/vault/_pipeline/runs/full-corpus-20260503-merged
Step 1f — refresh the findings doc
python3 interviews/vault-cli/scripts/summarize_audit.py \
--input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
--output interviews/vault-cli/docs/AUDIT_FINDINGS_2026-05-04.md # date the new file
Commit the new findings doc; leave the 2026-05-03 one in place as a historical baseline.
Estimated Gemini cost for Step 1: ~80 calls (~30% of one day's quota). Estimated wall time: 1-2 hours of audit runs.
Step 2 — Phase 5: walk corrections interactively
After Step 1 there will be ~3,300 rows with suggested_corrections.
Use apply_corrections.py to walk them.
Step 2a — start with the safest auto-acceptable batch
Format-marker-only corrections are mechanical — auto-acceptable:
python3 interviews/vault-cli/scripts/apply_corrections.py \
--input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
--filter-gate format_compliance \
--auto-accept-format
Expect this to clear ~600-800 format-only rows automatically.
Step 2b — math errors (highest priority, manual review)
python3 interviews/vault-cli/scripts/apply_corrections.py \
--input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
--filter-gate math_correct \
--limit 50 # cap each session to ~50 to avoid review fatigue
Per CORPUS_HARDENING_PLAN.md §10 Q2, when math is wrong, accept napkin_math AND realistic_solution as a unit. Reject if Gemini's proposed fix changes meaning.
Step 2c — coherence failures by failure mode
Vendor fabrication (1 instance: cloud-0560) — likely scenario rewrite. Physical absurdity (~70 instances) — usually a number adjustment. Scenario/solution mismatch (~80 instances) — review case-by-case.
python3 interviews/vault-cli/scripts/apply_corrections.py \
--input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
--filter-gate coherence \
--limit 50
Step 2d — level-fit (relabel down per CORPUS_HARDENING_PLAN.md §10 Q3)
python3 interviews/vault-cli/scripts/apply_corrections.py \
--input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
--filter-gate level_fit \
--limit 100
The default disposition is to relabel the question to the actual level. Reject if you want to rewrite the question UP (separate authoring task, not Phase 5).
Step 2e — placeholder titles (Phase 7)
python3 interviews/vault-cli/scripts/apply_corrections.py \
--input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
--filter-gate title_quality
Validation after each apply session
vault check --strict # must stay green
pytest interviews/vault-cli/tests/ -q
git status # review the YAML diffs before committing
Commit each disposition session as one logical commit:
fix(vault): format markers — N questions auto-accepted from Phase 5 reviewfix(vault): math errors — N questions reviewed (with notes for any non-trivial)- etc.
Step 3 — Phase 6: schema tightening + lift format gate
Once the corpus is clean (format-failure rate ≈ 0), make the cleanliness load-bearing.
files to edit:
interviews/vault/schema/question_schema.yaml:
- Add pattern constraint to Details.common_mistake
pattern: '(?s).*\*\*The Pitfall:\*\*.*\*\*The Rationale:\*\*.*\*\*The Consequence:\*\*.*'
- Add pattern constraint to Details.napkin_math
pattern: '(?s).*\*\*Assumptions.*\*\*Calculations:\*\*.*\*\*Conclusion.*'
- Make provenance required: true (no default fallback at YAML load)
interviews/vault-cli/src/vault_cli/models.py:
- Flip Details `model_config = ConfigDict(extra="allow")` → `extra="forbid"`
(Pre-checked: 0 unknown extra keys across 9,446 published YAMLs)
- Add explicit attributes for any audit-stamp fields if needed
interviews/vault-cli/src/vault_cli/validator.py:
- Add _format_compliance() to structural_tier (lift gate_format from
validate_drafts.py into a published-corpus invariant)
run:
vault codegen # regenerate Pydantic / SQL DDL / TS types
pytest # add tests covering the new invariant
vault check --strict # 0 failures
Also fix 6 cloud questions with stray top-level options/correct_index:
cloud-0048,cloud-0273,cloud-0291,cloud-0336,cloud-0418,cloud-0454- Move those fields from top-level into
details:
Step 4 — Phase 7: any remaining title-quality fixes
After Phase 5's title-quality session, re-audit just the questions that were placeholders to verify the fixes landed:
python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
--qids <the-original-placeholder-qids> \
--propose-fixes \
--output interviews/vault/_pipeline/runs/full-corpus-20260503-titles-verify
Expect title_quality to be good for all of them now.
Step 5 — Phase 8 second half: vault audit CLI subcommand
The cron workflow is already shipped. The CLI integration isn't yet.
add new file:
interviews/vault-cli/src/vault_cli/commands/audit.py
- vault audit run [--all|--tracks|--qids] [--propose-fixes] ...
→ wraps audit_corpus_batched.py
- vault audit review <run-dir> [--filter-gate ...]
→ wraps apply_corrections.py
- vault audit summarize <run-dir> [--output ...]
→ wraps summarize_audit.py
register in:
interviews/vault-cli/src/vault_cli/main.py
add: from vault_cli.commands import audit; audit.register(app)
Step 6 — Phase 9: paper update + release
# Update paper.tech with post-audit corpus stats:
# - 9,446 published, audit pass rates per gate, per-track tables
# - Methodology paragraph naming gemini-3.1-pro-preview as audit model
# - Citation of audit_corpus_batched.py + AUDIT_FINDINGS_<date>.md
vault export-paper
vault build --local-json # release_hash should roll
vault publish 1.0.0
vault verify 1.0.0 --git-ref v1.0.0 # citation-grade round-trip
git tag vault-1.0.0
Tooling reference
All scripts live under interviews/vault-cli/scripts/:
| Script | What it does |
|---|---|
audit_corpus_batched.py |
Batched corpus audit. --workers N --propose-fixes |
apply_corrections.py |
Interactive accept/reject for proposed corrections |
summarize_audit.py |
Generate AUDIT_FINDINGS markdown from 01_audit.json |
merge_audit_runs.py |
Combine multiple per-track output dirs |
backfill_provenance.py |
Phase 1 helper (already run) |
_judges.py |
Shared prompts + Gemini-call wrapper |
_batching.py |
Generic char-budgeted batcher |
validate_drafts.py |
Single-draft multi-gate (per-question) |
audit_math.py |
Single-question math spot-check |
audit_chains_with_gemini.py |
Chain-level audit (existing, separate concern) |
Preserved-for-adaptation in interviews/vault/scripts/ (see DEPRECATED.md):
gemini_backfill_question.py,gpt_backfill_question.pygemini_cli_generate_questions.py,generate.pygemini_fix_errors.py,deep_verify.py
Open questions that may still need answers
From CORPUS_HARDENING_PLAN.md §10:
extra="forbid"onQuestion(recommended: keep lenient on Question, strict on Details — needed for Phase 6)- Cron cadence (recommended: monthly — already implemented)
- Per-track audit floor (recommended: introduce post-Phase-5 cleanup)
audit_math.pydeprecation timing (recommended: keep one quarter)- AUTHORING.md maintenance hook (recommended: pre-commit field-name check)
- Sample size for cron (recommended: full monthly — already configured)
Q2 + Q3 already answered (math errors as a unit; level relabel down).
Commit log highlights from this session
d2621cc9e feat(vault-cli): merge_audit_runs.py + Phase 4 findings doc
2d9330da6 fix(vault-cli): isolate gemini CLI scratch files in temp dir
e7a2a27bf feat(ci): staffml-audit-corpus-monthly.yml — recurring corpus audit workflow
3eaac3ca9 feat(vault-cli): summarize_audit.py — Phase 4 finalization helper
1722133fa feat(vault-cli): apply_corrections.py — interactive accept/reject
1b58a9c50 feat(vault-cli): parallel audit_corpus_batched.py with submit-stagger
12032f700 fix(vault-cli): audit_corpus_batched.py reliability fixes from canary
03031dc38 test(vault-cli): smoke tests for audit_corpus_batched batching
69cf6f0a5 feat(vault-cli): audit_corpus_batched.py — full-corpus batched audit
dd71c66ca feat(vault-cli): _judges.py + _batching.py — shared infra
f691d6c14 feat(vault-cli): vault new scaffolds full Pitfall/Rationale/Consequence stubs
7500b9281 docs(vault): AUTHORING.md — single-source authoring reference
e8f0faa83 chore(vault): explicit provenance: imported on 407 published questions
39d567f26 feat(vault-cli): backfill_provenance.py — Phase 1 helper
3f0773706 chore(vault): restore 6 unique-capability scripts as preserved-for-adaptation
56d3ed155 chore(vault): remove 18 deprecated scripts per CORPUS_HARDENING_PLAN.md Phase 0
36f2ef592 docs(vault-cli): CORPUS_HARDENING_PLAN.md — supersedes RELEASE_AUDIT_PLAN.md
Estimated remaining time to ship 1.0.0
Step 1 (Phase 4 backfill): ~2h Gemini (~80 calls today's quota)
Step 2 (Phase 5 review): ~6h human (the big sink)
Step 3 (Phase 6 schema): ~2h
Step 4 (Phase 7 verify): ~30 min
Step 5 (Phase 8 CLI subcommand): ~30 min
Step 6 (Phase 9 release): ~1h
─────────────────────────────────────────────
Total ~12h, mostly Phase 5 review
Spread over 2-3 working days.
Troubleshooting
Q: Gemini calls fail with rate-limit errors.
A: Drop --workers to 4 and re-run. The script's resume picks up where it left off.
Q: vault check --strict fails after applying corrections.
A: A correction's edited YAML failed Pydantic. The script logs the qid;
investigate the diff. apply_corrections.py validates BEFORE writing,
so this only happens if the corpus had a stale validation issue.
Q: pre-commit hook codespell rejects a finding doc.
A: summarize_audit.py:truncate_words already addresses mid-word
truncations. If new typos slip in, lengthen the truncation budget or
add a baseline allow.
Q: Worktree shows untracked gemini scratch files.
A: My fix in 2d9330da6 isolates new gemini CLI scratch to a temp dir.
Old scratch from before that commit lingers — safe to /bin/rm (do
not use rm since it's aliased to trash).