cs249r_book/interviews/vault-cli/docs/PHASE_4_HANDOFF.md

# Phase 4 audit handoff — resume guide for the next session

**Status as of 2026-05-03 (updated):** Phases 0-4 + Phase 4 backfill +
Phase 8 (CLI + cron) complete. **Ready for Phase 5 (interactive
review).**
**Branch:** `yaml-audit` (97 commits ahead of origin/dev, 0 behind — merged into
local dev when ready)
**Worktree:** `/Users/VJ/GitHub/MLSysBook-yaml-audit`
**Active workplan:** `interviews/vault-cli/docs/CORPUS_HARDENING_PLAN.md`

## Update appended 2026-05-03

The handoff doc below was written before the Phase 4 backfill ran. Key
deltas to know about:

- **Phase 4 backfill is done.** Cloud + edge failures that were
  audit-only got `--propose-fixes` passes. The merged dataset now has
  **2,757 questions with suggested_corrections** (up from 1,767),
  spanning all 5 tracks. **0 error rows** (all retried).
- **6 cloud questions migrated.** `cloud-{0048,0273,0291,0336,0418,0454}`
  had stray top-level `options`/`correct_index` (MCQ data) — moved
  into `details:` per the schema. Phase 6's `Details extra='forbid'`
  flip is now safe with no further corpus migrations.
- **Phase 8 CLI subcommand shipped.** `vault audit run / review /
  summarize / merge` wraps the underlying scripts. Cron workflow
  was already in place.

So the new session can **skip Step 1 (backfill)** in the doc below
and go straight to **Step 2 (Phase 5 interactive review)**.

The merged audit dataset for Phase 5 is at:
`interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json`

---

---

## What's done

```diff
+ Phase 0  Cleanup deprecated scripts + dead-end audit_corpus.py     ✅
+ Phase 1  Backfilled provenance: imported on 407 published YAMLs    ✅
+ Phase 2  AUTHORING.md + vault new scaffold with format-marker stubs ✅
+ Phase 3  Built audit_corpus_batched.py + _judges.py + _batching.py  ✅
+ Phase 4  Full-corpus audit run                                      ✅ (9,446 / 9,446)
```

**Phase 4 outputs (all in `interviews/vault/_pipeline/runs/`, gitignored):**

```
full-corpus-20260503/         main run dir (cloud audit-only + edge mixed + global + 140 mobile)
full-corpus-20260503-mobile/  parallel mobile run (1,824 rows with fixes)
full-corpus-20260503-tinyml/  parallel tinyml run (1,202 rows with fixes)
full-corpus-20260503-merged/  ✦ canonical merged dataset — start here
```

The merged dataset has all 9,446 questions with 1,767 already carrying
`suggested_corrections` from the `--propose-fixes` invocations.

**Phase 4 findings doc (committed):**
`interviews/vault-cli/docs/AUDIT_FINDINGS_2026-05-03.md`

---

## Final corpus state (Phase 4 results)

| gate | fail | rate |
|---|---:|---:|
| format_compliance | ~960 | 10.2% |
| level_fit | ~1,580 | 16.7% |
| coherence | ~480 | 5.1% |
| math_correct | ~330 | 3.5% |
| title_quality | ~250 placeholder + ~25 malformed | 2.9% |
| errors (need retry) | 20 (all global) | 0.2% |

Per-track failure rates: tinyml has the highest level-inflation rate
(21.4%); cloud has the most absolute math errors. Edge has higher
coherence-fail rate than other tracks (7%). See AUDIT_FINDINGS for
details and qid lists.

---

## What's left

```yaml
remaining:
  Phase 4 finalization:
    - retry 20 errored rows in global (1 invocation, ~1 batch)
    - backfill --propose-fixes on cloud's ~1,344 failures (no fixes today)
    - backfill --propose-fixes on edge's ~253 failures missing fixes
    - re-merge into full-corpus-20260503-merged/

  Phase 5  Walk ~3,300 corrections interactively                  ~6h human review
  Phase 6  Schema tightening + lift format gate                   ~2h
  Phase 7  Title-quality pass (~250 placeholders)                 ~30 calls + review
  Phase 8  Add `vault audit` CLI subcommand                       ~30 min
  Phase 9  Update paper.tech, vault publish 1.0.0, tag            ~1h
```

Cron workflow (`staffml-audit-corpus-monthly.yml`) is already shipped.

---

## How to resume

### Step 0 — sanity check the worktree

```bash
cd /Users/VJ/GitHub/MLSysBook-yaml-audit
git status                       # should be clean
git log --oneline -10            # confirm Phase 0-4 commits present
git branch                       # * yaml-audit
vault check --strict             # 10,711 loaded, 0 invariant failures
pytest interviews/vault-cli/tests/ -q  # 84 passed
ruff check interviews/vault-cli  # clean
```

### Step 1 — finish Phase 4 backfill (~5 invocations of Gemini)

The cloud track was audited with audit-only in invocations 1-7
(yesterday). To get suggested_corrections for cloud's failures, run a
propose-fixes pass against the failure-only subset.

#### Step 1a — list cloud failure qids

```bash
python3 -c "
import json
d = json.loads(open('interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json').read())
fails = [
    r['qid'] for r in d['rows']
    if r['qid'].startswith('cloud-')
    and not r.get('suggested_corrections')
    and any(r.get(g) == 'fail' for g in ('format_compliance','level_fit','coherence','math_correct'))
]
print(','.join(fails))
" > /tmp/cloud-fails-to-fix.txt

wc -w /tmp/cloud-fails-to-fix.txt   # ~1,344 qids
```

#### Step 1b — run propose-fixes on those qids

```bash
QIDS=$(cat /tmp/cloud-fails-to-fix.txt)

# Run multiple invocations until quota or qids exhausted.
# Each invocation does ~18 batches of 20 questions = ~360 qids.
# Cloud needs ~4 invocations to clear all failures.

for i in 1 2 3 4; do
  python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
    --qids "$QIDS" \
    --propose-fixes \
    --workers 8 \
    --max-calls 18 \
    --output interviews/vault/_pipeline/runs/full-corpus-20260503-cloud-backfill
  # The script auto-resumes — already-fixed qids skip on each iteration.
done
```

#### Step 1c — same for edge's 253 missing-fix failures

```bash
python3 -c "
import json
d = json.loads(open('interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json').read())
fails = [
    r['qid'] for r in d['rows']
    if r['qid'].startswith('edge-')
    and not r.get('suggested_corrections')
    and any(r.get(g) == 'fail' for g in ('format_compliance','level_fit','coherence','math_correct'))
]
print(','.join(fails))
" > /tmp/edge-fails-to-fix.txt

QIDS=$(cat /tmp/edge-fails-to-fix.txt)
python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
  --qids "$QIDS" \
  --propose-fixes \
  --workers 8 \
  --max-calls 18 \
  --output interviews/vault/_pipeline/runs/full-corpus-20260503-edge-backfill
```

#### Step 1d — retry the 20 errored global rows

These will resume automatically on any --tracks global run:

```bash
python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
  --tracks global \
  --propose-fixes \
  --workers 8 \
  --max-calls 18 \
  --output interviews/vault/_pipeline/runs/full-corpus-20260503    # main dir, will retry errors
```

#### Step 1e — re-merge

```bash
python3 interviews/vault-cli/scripts/merge_audit_runs.py \
  --inputs interviews/vault/_pipeline/runs/full-corpus-20260503 \
           interviews/vault/_pipeline/runs/full-corpus-20260503-mobile \
           interviews/vault/_pipeline/runs/full-corpus-20260503-tinyml \
           interviews/vault/_pipeline/runs/full-corpus-20260503-cloud-backfill \
           interviews/vault/_pipeline/runs/full-corpus-20260503-edge-backfill \
  --output interviews/vault/_pipeline/runs/full-corpus-20260503-merged
```

#### Step 1f — refresh the findings doc

```bash
python3 interviews/vault-cli/scripts/summarize_audit.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --output interviews/vault-cli/docs/AUDIT_FINDINGS_2026-05-04.md   # date the new file
```

Commit the new findings doc; leave the 2026-05-03 one in place as a
historical baseline.

**Estimated Gemini cost for Step 1: ~80 calls (~30% of one day's quota).**
**Estimated wall time: 1-2 hours of audit runs.**

---

### Step 2 — Phase 5: walk corrections interactively

After Step 1 there will be ~3,300 rows with `suggested_corrections`.
Use `apply_corrections.py` to walk them.

#### Step 2a — start with the safest auto-acceptable batch

Format-marker-only corrections are mechanical — auto-acceptable:

```bash
python3 interviews/vault-cli/scripts/apply_corrections.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --filter-gate format_compliance \
  --auto-accept-format
```

Expect this to clear ~600-800 format-only rows automatically.

#### Step 2b — math errors (highest priority, manual review)

```bash
python3 interviews/vault-cli/scripts/apply_corrections.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --filter-gate math_correct \
  --limit 50    # cap each session to ~50 to avoid review fatigue
```

Per CORPUS_HARDENING_PLAN.md §10 Q2, when math is wrong, accept napkin_math AND realistic_solution as a unit. Reject if Gemini's proposed fix changes meaning.

#### Step 2c — coherence failures by failure mode

Vendor fabrication (1 instance: cloud-0560) — likely scenario rewrite.
Physical absurdity (~70 instances) — usually a number adjustment.
Scenario/solution mismatch (~80 instances) — review case-by-case.

```bash
python3 interviews/vault-cli/scripts/apply_corrections.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --filter-gate coherence \
  --limit 50
```

#### Step 2d — level-fit (relabel down per CORPUS_HARDENING_PLAN.md §10 Q3)

```bash
python3 interviews/vault-cli/scripts/apply_corrections.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --filter-gate level_fit \
  --limit 100
```

The default disposition is to relabel the question to the actual level.
Reject if you want to rewrite the question UP (separate authoring task,
not Phase 5).

#### Step 2e — placeholder titles (Phase 7)

```bash
python3 interviews/vault-cli/scripts/apply_corrections.py \
  --input interviews/vault/_pipeline/runs/full-corpus-20260503-merged/01_audit.json \
  --filter-gate title_quality
```

#### Validation after each apply session

```bash
vault check --strict     # must stay green
pytest interviews/vault-cli/tests/ -q
git status               # review the YAML diffs before committing
```

Commit each disposition session as one logical commit:
- `fix(vault): format markers — N questions auto-accepted from Phase 5 review`
- `fix(vault): math errors — N questions reviewed (with notes for any non-trivial)`
- etc.

---

### Step 3 — Phase 6: schema tightening + lift format gate

Once the corpus is clean (format-failure rate ≈ 0), make the cleanliness load-bearing.

```yaml
files to edit:
  interviews/vault/schema/question_schema.yaml:
    - Add pattern constraint to Details.common_mistake
      pattern: '(?s).*\*\*The Pitfall:\*\*.*\*\*The Rationale:\*\*.*\*\*The Consequence:\*\*.*'
    - Add pattern constraint to Details.napkin_math
      pattern: '(?s).*\*\*Assumptions.*\*\*Calculations:\*\*.*\*\*Conclusion.*'
    - Make provenance required: true (no default fallback at YAML load)

  interviews/vault-cli/src/vault_cli/models.py:
    - Flip Details `model_config = ConfigDict(extra="allow")` → `extra="forbid"`
      (Pre-checked: 0 unknown extra keys across 9,446 published YAMLs)
    - Add explicit attributes for any audit-stamp fields if needed

  interviews/vault-cli/src/vault_cli/validator.py:
    - Add _format_compliance() to structural_tier (lift gate_format from
      validate_drafts.py into a published-corpus invariant)

run:
  vault codegen          # regenerate Pydantic / SQL DDL / TS types
  pytest                 # add tests covering the new invariant
  vault check --strict   # 0 failures
```

**Also fix 6 cloud questions with stray top-level `options`/`correct_index`:**
- `cloud-0048`, `cloud-0273`, `cloud-0291`, `cloud-0336`, `cloud-0418`, `cloud-0454`
- Move those fields from top-level into `details:`

---

### Step 4 — Phase 7: any remaining title-quality fixes

After Phase 5's title-quality session, re-audit just the questions that
were placeholders to verify the fixes landed:

```bash
python3 interviews/vault-cli/scripts/audit_corpus_batched.py \
  --qids <the-original-placeholder-qids> \
  --propose-fixes \
  --output interviews/vault/_pipeline/runs/full-corpus-20260503-titles-verify
```

Expect title_quality to be `good` for all of them now.

---

### Step 5 — Phase 8 second half: `vault audit` CLI subcommand

The cron workflow is already shipped. The CLI integration isn't yet.

```yaml
add new file:
  interviews/vault-cli/src/vault_cli/commands/audit.py
    - vault audit run [--all|--tracks|--qids] [--propose-fixes] ...
      → wraps audit_corpus_batched.py
    - vault audit review <run-dir> [--filter-gate ...]
      → wraps apply_corrections.py
    - vault audit summarize <run-dir> [--output ...]
      → wraps summarize_audit.py

register in:
  interviews/vault-cli/src/vault_cli/main.py
    add: from vault_cli.commands import audit; audit.register(app)
```

---

### Step 6 — Phase 9: paper update + release

```bash
# Update paper.tech with post-audit corpus stats:
#   - 9,446 published, audit pass rates per gate, per-track tables
#   - Methodology paragraph naming gemini-3.1-pro-preview as audit model
#   - Citation of audit_corpus_batched.py + AUDIT_FINDINGS_<date>.md

vault export-paper
vault build --local-json    # release_hash should roll
vault publish 1.0.0
vault verify 1.0.0 --git-ref v1.0.0   # citation-grade round-trip

git tag vault-1.0.0
```

---

## Tooling reference

All scripts live under `interviews/vault-cli/scripts/`:

| Script | What it does |
|---|---|
| `audit_corpus_batched.py` | Batched corpus audit. `--workers N --propose-fixes` |
| `apply_corrections.py` | Interactive accept/reject for proposed corrections |
| `summarize_audit.py` | Generate AUDIT_FINDINGS markdown from 01_audit.json |
| `merge_audit_runs.py` | Combine multiple per-track output dirs |
| `backfill_provenance.py` | Phase 1 helper (already run) |
| `_judges.py` | Shared prompts + Gemini-call wrapper |
| `_batching.py` | Generic char-budgeted batcher |
| `validate_drafts.py` | Single-draft multi-gate (per-question) |
| `audit_math.py` | Single-question math spot-check |
| `audit_chains_with_gemini.py` | Chain-level audit (existing, separate concern) |

Preserved-for-adaptation in `interviews/vault/scripts/` (see DEPRECATED.md):
- `gemini_backfill_question.py`, `gpt_backfill_question.py`
- `gemini_cli_generate_questions.py`, `generate.py`
- `gemini_fix_errors.py`, `deep_verify.py`

---

## Open questions that may still need answers

From `CORPUS_HARDENING_PLAN.md §10`:
1. `extra="forbid"` on `Question` (recommended: keep lenient on Question, strict on Details — needed for Phase 6)
4. Cron cadence (recommended: monthly — already implemented)
5. Per-track audit floor (recommended: introduce post-Phase-5 cleanup)
6. `audit_math.py` deprecation timing (recommended: keep one quarter)
7. AUTHORING.md maintenance hook (recommended: pre-commit field-name check)
8. Sample size for cron (recommended: full monthly — already configured)

Q2 + Q3 already answered (math errors as a unit; level relabel down).

---

## Commit log highlights from this session

```
d2621cc9e  feat(vault-cli): merge_audit_runs.py + Phase 4 findings doc
2d9330da6  fix(vault-cli): isolate gemini CLI scratch files in temp dir
e7a2a27bf  feat(ci): staffml-audit-corpus-monthly.yml — recurring corpus audit workflow
3eaac3ca9  feat(vault-cli): summarize_audit.py — Phase 4 finalization helper
1722133fa  feat(vault-cli): apply_corrections.py — interactive accept/reject
1b58a9c50  feat(vault-cli): parallel audit_corpus_batched.py with submit-stagger
12032f700  fix(vault-cli): audit_corpus_batched.py reliability fixes from canary
03031dc38  test(vault-cli): smoke tests for audit_corpus_batched batching
69cf6f0a5  feat(vault-cli): audit_corpus_batched.py — full-corpus batched audit
dd71c66ca  feat(vault-cli): _judges.py + _batching.py — shared infra
f691d6c14  feat(vault-cli): vault new scaffolds full Pitfall/Rationale/Consequence stubs
7500b9281  docs(vault): AUTHORING.md — single-source authoring reference
e8f0faa83  chore(vault): explicit provenance: imported on 407 published questions
39d567f26  feat(vault-cli): backfill_provenance.py — Phase 1 helper
3f0773706  chore(vault): restore 6 unique-capability scripts as preserved-for-adaptation
56d3ed155  chore(vault): remove 18 deprecated scripts per CORPUS_HARDENING_PLAN.md Phase 0
36f2ef592  docs(vault-cli): CORPUS_HARDENING_PLAN.md — supersedes RELEASE_AUDIT_PLAN.md
```

---

## Estimated remaining time to ship 1.0.0

```
Step 1 (Phase 4 backfill):        ~2h Gemini  (~80 calls today's quota)
Step 2 (Phase 5 review):          ~6h human   (the big sink)
Step 3 (Phase 6 schema):          ~2h
Step 4 (Phase 7 verify):          ~30 min
Step 5 (Phase 8 CLI subcommand):  ~30 min
Step 6 (Phase 9 release):         ~1h
─────────────────────────────────────────────
Total                             ~12h, mostly Phase 5 review
```

Spread over 2-3 working days.

---

## Troubleshooting

**Q: Gemini calls fail with rate-limit errors.**
A: Drop `--workers` to 4 and re-run. The script's resume picks up where it left off.

**Q: `vault check --strict` fails after applying corrections.**
A: A correction's edited YAML failed Pydantic. The script logs the qid;
investigate the diff. `apply_corrections.py` validates BEFORE writing,
so this only happens if the corpus had a stale validation issue.

**Q: pre-commit hook codespell rejects a finding doc.**
A: `summarize_audit.py:truncate_words` already addresses mid-word
truncations. If new typos slip in, lengthen the truncation budget or
add a baseline allow.

**Q: Worktree shows untracked gemini scratch files.**
A: My fix in `2d9330da6` isolates new gemini CLI scratch to a temp dir.
Old scratch from before that commit lingers — safe to `/bin/rm` (do
not use `rm` since it's aliased to `trash`).