mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-07 10:08:50 -05:00

Files

Vijay Janapa Reddi 9ee3c34303 docs(vault-cli): PHASE_4_HANDOFF — update post-backfill

Append 2026-05-03 update reflecting:
  - Phase 4 backfill complete (2,757 corrections proposed; 0 errors)
  - 6 cloud questions migrated (stray top-level MCQ fields → details)
  - Phase 8 CLI subcommand shipped (vault audit run/review/summarize/merge)

Next session can skip Step 1 (backfill — done) and start at
Step 2 (Phase 5 interactive review).

2026-05-03 18:39:37 -04:00

18 KiB

Raw Permalink Blame History

Phase 4 audit handoff — resume guide for the next session

Status as of 2026-05-03 (updated): Phases 0-4 + Phase 4 backfill + Phase 8 (CLI + cron) complete. Ready for Phase 5 (interactive review). Branch: yaml-audit (97 commits ahead of origin/dev, 0 behind — merged into local dev when ready) Worktree: /Users/VJ/GitHub/MLSysBook-yaml-audit Active workplan: interviews/vault-cli/docs/CORPUS_HARDENING_PLAN.md

Update appended 2026-05-03

The handoff doc below was written before the Phase 4 backfill ran. Key deltas to know about:

Phase 4 backfill is done. Cloud + edge failures that were audit-only got --propose-fixes passes. The merged dataset now has 2,757 questions with suggested_corrections (up from 1,767), spanning all 5 tracks. 0 error rows (all retried).
6 cloud questions migrated. cloud-{0048,0273,0291,0336,0418,0454} had stray top-level options/correct_index (MCQ data) — moved into details: per the schema. Phase 6's Details extra='forbid' flip is now safe with no further corpus migrations.
Phase 8 CLI subcommand shipped. vault audit run / review / summarize / merge wraps the underlying scripts. Cron workflow was already in place.

So the new session can skip Step 1 (backfill) in the doc below and go straight to Step 2 (Phase 5 interactive review).

The merged audit dataset for Phase 5 is at: interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json

What's done

+ Phase 0  Cleanup deprecated scripts + dead-end audit_corpus.py     ✅
+ Phase 1  Backfilled provenance: imported on 407 published YAMLs    ✅
+ Phase 2  AUTHORING.md + vault new scaffold with format-marker stubs ✅
+ Phase 3  Built audit_corpus_batched.py + _judges.py + _batching.py  ✅
+ Phase 4  Full-corpus audit run                                      ✅ (9,446 / 9,446)

Phase 4 outputs (all in interviews/vault/_pipeline/runs/, gitignored):

full-corpus-20260503/         main run dir (cloud audit-only + edge mixed + global + 140 mobile)
full-corpus-20260503-mobile/  parallel mobile run (1,824 rows with fixes)
full-corpus-20260503-tinyml/  parallel tinyml run (1,202 rows with fixes)
full-corpus-20260503-merged/  ✦ canonical merged dataset — start here

The merged dataset has all 9,446 questions with 1,767 already carrying suggested_corrections from the --propose-fixes invocations.

Phase 4 findings doc (committed): interviews/vault-cli/docs/AUDIT_FINDINGS_2026-05-03.md

Final corpus state (Phase 4 results)

gate	fail	rate
format_compliance	~960	10.2%
level_fit	~1,580	16.7%
coherence	~480	5.1%
math_correct	~330	3.5%
title_quality	~250 placeholder + ~25 malformed	2.9%
errors (need retry)	20 (all global)	0.2%

Per-track failure rates: tinyml has the highest level-inflation rate (21.4%); cloud has the most absolute math errors. Edge has higher coherence-fail rate than other tracks (7%). See AUDIT_FINDINGS for details and qid lists.

What's left

remaining:
  Phase 4 finalization:
    - retry 20 errored rows in global (1 invocation, ~1 batch)
    - backfill --propose-fixes on cloud's ~1,344 failures (no fixes today)
    - backfill --propose-fixes on edge's ~253 failures missing fixes
    - re-merge into full-corpus-20260503-merged/

  Phase 5  Walk ~3,300 corrections interactively                  ~6h human review
  Phase 6  Schema tightening + lift format gate                   ~2h
  Phase 7  Title-quality pass (~250 placeholders)                 ~30 calls + review
  Phase 8  Add `vault audit` CLI subcommand                       ~30 min
  Phase 9  Update paper.tech, vault publish 1.0.0, tag            ~1h

Cron workflow (staffml-audit-corpus-monthly.yml) is already shipped.

How to resume

Step 0 — sanity check the worktree

cd /Users/VJ/GitHub/MLSysBook-yaml-audit
git status                       # should be clean
git log --oneline -10            # confirm Phase 0-4 commits present
git branch                       # * yaml-audit
vault check --strict             # 10,711 loaded, 0 invariant failures
pytest interviews/vault-cli/tests/ -q  # 84 passed
ruff check interviews/vault-cli  # clean

Step 1 — finish Phase 4 backfill (~5 invocations of Gemini)

The cloud track was audited with audit-only in invocations 1-7 (yesterday). To get suggested_corrections for cloud's failures, run a propose-fixes pass against the failure-only subset.

Step 1a — list cloud failure qids

python3 -c "
import json
d = json.loads(open('interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json').read())
fails = [
    r['qid'] for r in d['rows']
    if r['qid'].startswith('cloud-')
    and not r.get('suggested_corrections')
    and any(r.get(g) == 'fail' for g in ('format_compliance','level_fit','coherence','math_correct'))
]
print(','.join(fails))
" > /tmp/cloud-fails-to-fix.txt

wc -w /tmp/cloud-fails-to-fix.txt   # ~1,344 qids

Step 1b — run propose-fixes on those qids

QIDS=$(cat /tmp/cloud-fails-to-fix.txt)

# Run multiple invocations until quota or qids exhausted.
# Each invocation does ~18 batches of 20 questions = ~360 qids.
# Cloud needs ~4 invocations to clear all failures.

for i in 1 2 3 4; do
  python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
    --qids "$QIDS" \
    --propose-fixes \
    --workers 8 \
    --max-calls 18 \
    --output interviews/vault/_pipeline/runs/full-corpus-20260503-cloud-backfill
  # The script auto-resumes — already-fixed qids skip on each iteration.
done

Step 1c — same for edge's 253 missing-fix failures

python3 -c "
import json
d = json.loads(open('interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json').read())
fails = [
    r['qid'] for r in d['rows']
    if r['qid'].startswith('edge-')
    and not r.get('suggested_corrections')
    and any(r.get(g) == 'fail' for g in ('format_compliance','level_fit','coherence','math_correct'))
]
print(','.join(fails))
" > /tmp/edge-fails-to-fix.txt

QIDS=$(cat /tmp/edge-fails-to-fix.txt)
python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
  --qids "$QIDS" \
  --propose-fixes \
  --workers 8 \
  --max-calls 18 \
  --output interviews/vault/_pipeline/runs/full-corpus-20260503-edge-backfill

Step 1d — retry the 20 errored global rows

These will resume automatically on any --tracks global run:

python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
  --tracks global \
  --propose-fixes \
  --workers 8 \
  --max-calls 18 \
  --output interviews/vault/_pipeline/runs/full-corpus-20260503    # main dir, will retry errors

Step 1e — re-merge

python3 interviews/vault-cli/scripts/merge_audit_runs.py \
  --inputs interviews/vault/_pipeline/runs/full-corpus-20260503 \
           interviews/vault/_pipeline/runs/full-corpus-20260503-mobile \
           interviews/vault/_pipeline/runs/full-corpus-20260503-tinyml \
           interviews/vault/_pipeline/runs/full-corpus-20260503-cloud-backfill \
           interviews/vault/_pipeline/runs/full-corpus-20260503-edge-backfill \
  --output interviews/vault/_pipeline/runs/full-corpus-20260503-merged

Step 1f — refresh the findings doc

python3 interviews/vault-cli/scripts/summarize_audit.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --output interviews/vault-cli/docs/AUDIT_FINDINGS_2026-05-04.md   # date the new file

Commit the new findings doc; leave the 2026-05-03 one in place as a historical baseline.

Estimated Gemini cost for Step 1: ~80 calls (~30% of one day's quota). Estimated wall time: 1-2 hours of audit runs.

Step 2 — Phase 5: walk corrections interactively

After Step 1 there will be ~3,300 rows with suggested_corrections. Use apply_corrections.py to walk them.

Step 2a — start with the safest auto-acceptable batch

Format-marker-only corrections are mechanical — auto-acceptable:

python3 interviews/vault-cli/scripts/apply_corrections.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --filter-gate format_compliance \
  --auto-accept-format

Expect this to clear ~600-800 format-only rows automatically.

Step 2b — math errors (highest priority, manual review)

python3 interviews/vault-cli/scripts/apply_corrections.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --filter-gate math_correct \
  --limit 50    # cap each session to ~50 to avoid review fatigue

Per CORPUS_HARDENING_PLAN.md §10 Q2, when math is wrong, accept napkin_math AND realistic_solution as a unit. Reject if Gemini's proposed fix changes meaning.

Step 2c — coherence failures by failure mode

Vendor fabrication (1 instance: cloud-0560) — likely scenario rewrite. Physical absurdity (~70 instances) — usually a number adjustment. Scenario/solution mismatch (~80 instances) — review case-by-case.

python3 interviews/vault-cli/scripts/apply_corrections.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --filter-gate coherence \
  --limit 50

Step 2d — level-fit (relabel down per CORPUS_HARDENING_PLAN.md §10 Q3)

python3 interviews/vault-cli/scripts/apply_corrections.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --filter-gate level_fit \
  --limit 100

The default disposition is to relabel the question to the actual level. Reject if you want to rewrite the question UP (separate authoring task, not Phase 5).

Step 2e — placeholder titles (Phase 7)

python3 interviews/vault-cli/scripts/apply_corrections.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --filter-gate title_quality

Validation after each apply session

vault check --strict     # must stay green
pytest interviews/vault-cli/tests/ -q
git status               # review the YAML diffs before committing

Commit each disposition session as one logical commit:

fix(vault): format markers — N questions auto-accepted from Phase 5 review
fix(vault): math errors — N questions reviewed (with notes for any non-trivial)
etc.

Step 3 — Phase 6: schema tightening + lift format gate

Once the corpus is clean (format-failure rate ≈ 0), make the cleanliness load-bearing.

files to edit:
  interviews/vault/schema/question_schema.yaml:
    - Add pattern constraint to Details.common_mistake
      pattern: '(?s).*\*\*The Pitfall:\*\*.*\*\*The Rationale:\*\*.*\*\*The Consequence:\*\*.*'
    - Add pattern constraint to Details.napkin_math
      pattern: '(?s).*\*\*Assumptions.*\*\*Calculations:\*\*.*\*\*Conclusion.*'
    - Make provenance required: true (no default fallback at YAML load)

  interviews/vault-cli/src/vault_cli/models.py:
    - Flip Details `model_config = ConfigDict(extra="allow")` → `extra="forbid"`
      (Pre-checked: 0 unknown extra keys across 9,446 published YAMLs)
    - Add explicit attributes for any audit-stamp fields if needed

  interviews/vault-cli/src/vault_cli/validator.py:
    - Add _format_compliance() to structural_tier (lift gate_format from
      validate_drafts.py into a published-corpus invariant)

run:
  vault codegen          # regenerate Pydantic / SQL DDL / TS types
  pytest                 # add tests covering the new invariant
  vault check --strict   # 0 failures

Also fix 6 cloud questions with stray top-level options/correct_index:

cloud-0048, cloud-0273, cloud-0291, cloud-0336, cloud-0418, cloud-0454
Move those fields from top-level into details:

Step 4 — Phase 7: any remaining title-quality fixes

After Phase 5's title-quality session, re-audit just the questions that were placeholders to verify the fixes landed:

python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
  --qids <the-original-placeholder-qids> \
  --propose-fixes \
  --output interviews/vault/_pipeline/runs/full-corpus-20260503-titles-verify

Expect title_quality to be good for all of them now.

Step 5 — Phase 8 second half: `vault audit` CLI subcommand

The cron workflow is already shipped. The CLI integration isn't yet.

add new file:
  interviews/vault-cli/src/vault_cli/commands/audit.py
    - vault audit run [--all|--tracks|--qids] [--propose-fixes] ...
      → wraps audit_corpus_batched.py
    - vault audit review <run-dir> [--filter-gate ...]
      → wraps apply_corrections.py
    - vault audit summarize <run-dir> [--output ...]
      → wraps summarize_audit.py

register in:
  interviews/vault-cli/src/vault_cli/main.py
    add: from vault_cli.commands import audit; audit.register(app)

Step 6 — Phase 9: paper update + release

# Update paper.tech with post-audit corpus stats:
#   - 9,446 published, audit pass rates per gate, per-track tables
#   - Methodology paragraph naming gemini-3.1-pro-preview as audit model
#   - Citation of audit_corpus_batched.py + AUDIT_FINDINGS_<date>.md

vault export-paper
vault build --local-json    # release_hash should roll
vault publish 1.0.0
vault verify 1.0.0 --git-ref v1.0.0   # citation-grade round-trip

git tag vault-1.0.0

Tooling reference

All scripts live under interviews/vault-cli/scripts/:

Script	What it does
`audit_corpus_batched.py`	Batched corpus audit. `--workers N --propose-fixes`
`apply_corrections.py`	Interactive accept/reject for proposed corrections
`summarize_audit.py`	Generate AUDIT_FINDINGS markdown from 01_audit.json
`merge_audit_runs.py`	Combine multiple per-track output dirs
`backfill_provenance.py`	Phase 1 helper (already run)
`_judges.py`	Shared prompts + Gemini-call wrapper
`_batching.py`	Generic char-budgeted batcher
`validate_drafts.py`	Single-draft multi-gate (per-question)
`audit_math.py`	Single-question math spot-check
`audit_chains_with_gemini.py`	Chain-level audit (existing, separate concern)

Preserved-for-adaptation in interviews/vault/scripts/ (see DEPRECATED.md):

gemini_backfill_question.py, gpt_backfill_question.py
gemini_cli_generate_questions.py, generate.py
gemini_fix_errors.py, deep_verify.py

Open questions that may still need answers

From CORPUS_HARDENING_PLAN.md §10:

extra="forbid" on Question (recommended: keep lenient on Question, strict on Details — needed for Phase 6)
Cron cadence (recommended: monthly — already implemented)
Per-track audit floor (recommended: introduce post-Phase-5 cleanup)
audit_math.py deprecation timing (recommended: keep one quarter)
AUTHORING.md maintenance hook (recommended: pre-commit field-name check)
Sample size for cron (recommended: full monthly — already configured)

Q2 + Q3 already answered (math errors as a unit; level relabel down).

Commit log highlights from this session

d2621cc9e  feat(vault-cli): merge_audit_runs.py + Phase 4 findings doc
2d9330da6  fix(vault-cli): isolate gemini CLI scratch files in temp dir
e7a2a27bf  feat(ci): staffml-audit-corpus-monthly.yml — recurring corpus audit workflow
3eaac3ca9  feat(vault-cli): summarize_audit.py — Phase 4 finalization helper
1722133fa  feat(vault-cli): apply_corrections.py — interactive accept/reject
1b58a9c50  feat(vault-cli): parallel audit_corpus_batched.py with submit-stagger
12032f700  fix(vault-cli): audit_corpus_batched.py reliability fixes from canary
03031dc38  test(vault-cli): smoke tests for audit_corpus_batched batching
69cf6f0a5  feat(vault-cli): audit_corpus_batched.py — full-corpus batched audit
dd71c66ca  feat(vault-cli): _judges.py + _batching.py — shared infra
f691d6c14  feat(vault-cli): vault new scaffolds full Pitfall/Rationale/Consequence stubs
7500b9281  docs(vault): AUTHORING.md — single-source authoring reference
e8f0faa83  chore(vault): explicit provenance: imported on 407 published questions
39d567f26  feat(vault-cli): backfill_provenance.py — Phase 1 helper
3f0773706  chore(vault): restore 6 unique-capability scripts as preserved-for-adaptation
56d3ed155  chore(vault): remove 18 deprecated scripts per CORPUS_HARDENING_PLAN.md Phase 0
36f2ef592  docs(vault-cli): CORPUS_HARDENING_PLAN.md — supersedes RELEASE_AUDIT_PLAN.md

Estimated remaining time to ship 1.0.0

Step 1 (Phase 4 backfill):        ~2h Gemini  (~80 calls today's quota)
Step 2 (Phase 5 review):          ~6h human   (the big sink)
Step 3 (Phase 6 schema):          ~2h
Step 4 (Phase 7 verify):          ~30 min
Step 5 (Phase 8 CLI subcommand):  ~30 min
Step 6 (Phase 9 release):         ~1h
─────────────────────────────────────────────
Total                             ~12h, mostly Phase 5 review

Spread over 2-3 working days.

Troubleshooting

Q: Gemini calls fail with rate-limit errors. A: Drop --workers to 4 and re-run. The script's resume picks up where it left off.

Q: vault check --strict fails after applying corrections. A: A correction's edited YAML failed Pydantic. The script logs the qid; investigate the diff. apply_corrections.py validates BEFORE writing, so this only happens if the corpus had a stale validation issue.

Q: pre-commit hook codespell rejects a finding doc. A: summarize_audit.py:truncate_words already addresses mid-word truncations. If new typos slip in, lengthen the truncation budget or add a baseline allow.

Q: Worktree shows untracked gemini scratch files. A: My fix in 2d9330da6 isolates new gemini CLI scratch to a temp dir. Old scratch from before that commit lingers — safe to /bin/rm (do not use rm since it's aliased to trash).

18 KiB Raw Permalink Blame History