feat(vault-cli): Phase 3.a + 3.b — gap-driven authoring tooling

Two new scripts that together close the loop from a gap entry to a reviewable candidate question with a multi-gate scorecard. generate_question_for_gap.py (3.a): - Reads a gap entry, loads between-questions + same-bucket exemplars, prompts gemini-3.1-pro-preview, runs Pydantic Question validation, and writes <track>/<area>/<id>.yaml.draft. The .draft suffix keeps drafts out of vault check / vault build until promotion. - ID allocator scans corpus + existing drafts so a batch run gets distinct fresh IDs without touching id-registry.yaml. - Modes: --gap-index, --gaps-from + --limit, --dry-run. validate_drafts.py (3.b): - Five gates per draft: schema (Pydantic), originality (cosine vs in-bucket neighbours via BAAI/bge-small-en-v1.5; matches the corpus embeddings.npz so values are comparable; cutoff 0.92), level_fit (Gemini-judge against same-level exemplars), coherence (Gemini-judge: scenario/question/solution consistency), and bridge (Gemini-judge: chain-fit between the gap's two anchors). - Final verdict pass iff every non-skipped gate passes. - Skips: --no-originality, --no-llm-judge. - Output: interviews/vault/draft-validation-scorecard.json. Smoke checks: - 3.a --dry-run --gap-index 0: resolves gap, builds prompt, allocates cloud-4579. Synthetic Gemini response Pydantic-validates clean. - 3.b on a synthetic /tmp draft: schema + originality pass (top neighbour cosine 0.73 vs 0.92 threshold). Phase 3.c (pilot run on 30 gaps) deferred: it generates new YAML question content that needs human review before promotion. The tooling ships ready; running it is a user-supervised step. CHAIN_ROADMAP.md Progress Log + Phase 3 status updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 02:03:55 -05:00 · 2026-05-01 11:31:06 -04:00
parent d94a79942c
commit 84b1fab082
3 changed files with 1085 additions and 2 deletions
--- a/interviews/vault-cli/docs/CHAIN_ROADMAP.md
+++ b/interviews/vault-cli/docs/CHAIN_ROADMAP.md
@@ -3,7 +3,7 @@
 **Status:** active workstream
 **Branch:** `yaml-audit` (off `dev`)
 **Worktree:** `/Users/VJ/GitHub/MLSysBook-yaml-audit`
-**Last updated:** 2026-05-01 (Phase 1 + 2 + 4.2/4.8 shipped — 879 chains, tier in UI, docs current)
+**Last updated:** 2026-05-01 (Phase 3.a + 3.b tooling shipped; 3.c pilot deferred for review)

 This document is the canonical resumable plan for the vault chain rebuild
 + corpus growth work. **Future Claude sessions: read the "Resume Here"
@@ -368,7 +368,7 @@ primary chains in default surfaces, exposes secondary in "more paths."

 ## Phase 3 — Gap-driven question authoring

-**Status:** `not started`
+**Status:** `tooling complete (3.a + 3.b); pilot 3.c deferred for review`
 **Goal:** Use the 138+ entries in `gaps.proposed.json` to author new
 questions filling missing rungs, validated independently before commit.
 This is the durable corpus growth strategy.
@@ -966,5 +966,103 @@ available to review the first few generated drafts.

 ---

+### 2026-05-01 — Phase 3.a + 3.b: authoring + validation tooling
+
+**What was done:**
+
+**Phase 3.a — `generate_question_for_gap.py`:**
+- Reads a gap entry (`{track, topic, missing_level, between, rationale}`)
+  from gaps.proposed.json (or .lenient.json), loads the between-questions
+  in full + up to 3 same-bucket exemplars at the target level, prompts
+  Gemini-3.1-pro-preview with the schema summary + bridge context, and
+  writes a candidate question to
+  `interviews/vault/questions/<track>/<area>/<id>.yaml.draft`.
+- ID allocator scans the existing corpus + already-written drafts so a
+  batch run gets distinct fresh IDs without touching `id-registry.yaml`
+  (registry append happens at promotion time, not generation).
+- Authoring metadata stamped under a private `_authoring` block:
+  origin model, tool name, timestamp, and the source gap entry. The
+  Pydantic Question model has `extra="allow"`, so this passes schema.
+- Modes: `--gap-index <N>` (single gap), `--gaps-from <path> --limit N`
+  (batch), `--dry-run` (build prompts without calling Gemini).
+- Smoke checks:
+  - `--dry-run --gap-index 0` resolves the first gap, finds 3 exemplars,
+    builds the prompt, allocates `cloud-4579`. ✓
+  - Synthetic Gemini response → `assemble_draft` → `Question.model_validate`
+    passes; YAML preview looks right (12-field body, sensible details). ✓
+
+**Phase 3.b — `validate_drafts.py`:**
+- Five-gate scorecard per draft:
+  1. **schema** — Pydantic Question (mandatory; downstream gates skip
+     on schema fail to avoid spurious LLM calls)
+  2. **originality** — embeds `title + scenario + question` with
+     `BAAI/bge-small-en-v1.5` (matches the corpus embeddings.npz model
+     so cosines are directly comparable), compares against in-bucket
+     neighbors, flags any `cosine ≥ 0.92`
+  3. **level_fit** — Gemini-judge against ≤5 published exemplars at the
+     target level in the same (track, topic)
+  4. **coherence** — Gemini-judge: scenario / question /
+     realistic_solution mutually consistent
+  5. **bridge** — Gemini-judge: candidate genuinely chains between the
+     two `between` questions named in `_authoring.gap`
+- Skips: `--no-originality` (skip embed model load),
+  `--no-llm-judge` (skip Gemini gates). Schema gate is unconditional.
+- Output: `interviews/vault/draft-validation-scorecard.json` with per-row
+  detail + final verdict (`pass | fail | error`).
+- Smoke check: synthetic draft in /tmp passed schema + originality
+  (top-neighbor cosine 0.73 vs 0.92 threshold). End-to-end runner
+  produced a well-formed scorecard. ✓
+
+**What was deliberately not done tonight:**
+- **Phase 3.c (pilot run on 30 highest-value gaps):** This generates
+  new YAML question content that needs human review *before* promotion.
+  Running 30 unsupervised generations and 30×4 LLM-judge calls without
+  the user available to spot-check the first few outputs is the wrong
+  shape of work for an overnight slot. The tooling is ready when the
+  user is.
+- **Phase 3.d–3.f:** Promotion + re-chain are downstream of 3.c
+  acceptance.
+
+**Recommended pilot when the user is back:**
+1. Pick 30 gaps from `gaps.proposed.lenient.json` where the bucket has
+   ≥4 questions already (just missing the bridge):
+   ```bash
+   python3 interviews/vault-cli/scripts/generate_question_for_gap.py \
+     --gaps-from interviews/vault/gaps.proposed.lenient.json \
+     --limit 30
+   ```
+2. Validate:
+   ```bash
+   python3 interviews/vault-cli/scripts/validate_drafts.py
+   ```
+3. Manually review the passing drafts (~20-25 expected).
+4. Promote: rename `.yaml.draft` → `.yaml`, append to id-registry.
+5. Re-run `build_chains_with_gemini.py --all` so the new questions get
+   absorbed into chains.
+
+**Files committed:**
+- `interviews/vault-cli/scripts/generate_question_for_gap.py` (new)
+- `interviews/vault-cli/scripts/validate_drafts.py` (new)
+- `interviews/vault-cli/docs/CHAIN_ROADMAP.md` (this Progress Log entry +
+  status flips)
+
+**Notes for next session:**
+- Both scripts assume `gemini` CLI on PATH (gemini-3.1-pro-preview) and,
+  for originality, the corpus's `embeddings.npz` (gitignored, regenerable
+  by the existing embedding scripts). `validate_drafts --no-llm-judge`
+  is a fast first cut that only exercises schema + originality if you
+  want to triage drafts before paying for the LLM-judge calls.
+- Heads up: each draft in 3.b consumes ~3 Gemini calls (level_fit +
+  coherence + bridge). 30 drafts → ~90 calls. Daily cap is 250.
+- `id-registry.yaml` is append-only and CI-enforced. Promotion (3.d)
+  needs to add new IDs to it; that's not yet wired into a script —
+  manual append for the pilot, then we can extract a `vault promote`
+  helper from the pattern.
+
+**Next step:** Phase 3.c — pilot run on 30 high-value gaps (best done
+with the user available to spot-check the first few outputs).
+
+---
+
 <!-- Append new entries above this comment, in reverse chronological is fine,
     but keep entries dated and self-contained for resume context. -->
--- a/interviews/vault-cli/scripts/generate_question_for_gap.py
+++ b/interviews/vault-cli/scripts/generate_question_for_gap.py
@@ -0,0 +1,493 @@
+#!/usr/bin/env python3
+"""Author a candidate question to fill a chain gap (Phase 3.a).
+
+Reads a gap entry (from gaps.proposed.json / gaps.proposed.lenient.json)
+that names two existing questions and a missing Bloom level between
+them, then prompts Gemini-3.1-pro-preview to draft a bridging question
+that fits the (track, topic, target-level) slot.
+
+Inputs per gap entry:
+  {
+    "track": "edge",
+    "topic": "memory-mapped-inference",
+    "missing_level": "L3",
+    "between": ["edge-0220", "edge-0224"],
+    "rationale": "..."
+  }
+
+Outputs per accepted draft:
+  interviews/vault/questions/<track>/<area>/<auto-id>.yaml.draft
+    — full question YAML with stamped authoring metadata. The .draft
+      suffix is intentional: vault check / vault build only load *.yaml,
+      so drafts ride along in the tree without affecting the release set
+      until they are promoted (renamed to .yaml) by a follow-up step.
+
+Usage:
+  python3 generate_question_for_gap.py --gap-index 0
+  python3 generate_question_for_gap.py --gaps-from interviews/vault/gaps.proposed.json --limit 5
+  python3 generate_question_for_gap.py --gaps-from <path> --limit 30 --output-dir <dir>
+
+This is the Phase 3.a tool. Validation (originality / level-fit /
+coherence / bridge) is a separate concern handled by validate_drafts.py.
+The only validation done here is structural Pydantic-schema acceptance,
+which is the gate that prevents writing a malformed YAML to disk.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import subprocess
+import sys
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+import yaml
+
+REPO_ROOT = Path(__file__).resolve().parents[3]
+VAULT_DIR = REPO_ROOT / "interviews" / "vault"
+QUESTIONS_DIR = VAULT_DIR / "questions"
+ID_REGISTRY = VAULT_DIR / "id-registry.yaml"
+DEFAULT_GAPS = VAULT_DIR / "gaps.proposed.json"
+
+GEMINI_MODEL = "gemini-3.1-pro-preview"
+INTER_CALL_DELAY_S = 6  # be polite to the Gemini CLI's rate limiter
+
+# Imported lazily so the file is still readable as a script even if the
+# vault_cli package isn't editable-installed in the current interpreter.
+try:
+    from vault_cli.models import Question
+except ImportError:  # pragma: no cover
+    Question = None  # type: ignore
+
+
+# ─── corpus + registry helpers ────────────────────────────────────────────
+
+
+def load_corpus_index() -> dict[str, dict]:
+    """qid → full YAML dict for every published question.
+
+    We need full bodies (scenario + details) for the between-questions and
+    exemplars; the corpus.json summary doesn't carry them.
+    """
+    out: dict[str, dict] = {}
+    for path in QUESTIONS_DIR.rglob("*.yaml"):
+        try:
+            with path.open(encoding="utf-8") as f:
+                d = yaml.safe_load(f)
+        except Exception:
+            continue
+        if isinstance(d, dict) and d.get("id"):
+            out[d["id"]] = d
+    return out
+
+
+def next_ids_per_track(corpus: dict[str, dict], existing_drafts: list[Path]) -> dict[str, int]:
+    """Return per-track next-available numeric suffix.
+
+    Considers BOTH committed YAMLs in the corpus AND any .yaml.draft files
+    written in earlier runs of this script — so a batch generating 30 drafts
+    gets 30 distinct IDs even before any of them is promoted into the
+    id-registry.
+    """
+    max_for_track: dict[str, int] = {}
+    pat = re.compile(r"^([a-z]+)-(\d+)$")
+    for qid in corpus:
+        m = pat.match(qid)
+        if not m:
+            continue
+        track, num = m.group(1), int(m.group(2))
+        if num > max_for_track.get(track, -1):
+            max_for_track[track] = num
+    for draft in existing_drafts:
+        # filename like edge-2545.yaml.draft
+        stem = draft.name.split(".")[0]
+        m = pat.match(stem)
+        if m:
+            track, num = m.group(1), int(m.group(2))
+            if num > max_for_track.get(track, -1):
+                max_for_track[track] = num
+    return {t: n + 1 for t, n in max_for_track.items()}
+
+
+# ─── prompt construction ──────────────────────────────────────────────────
+
+
+SCHEMA_SUMMARY = """SCHEMA SUMMARY (Pydantic Question, v1.0):
+  REQUIRED FIELDS:
+    schema_version: "1.0"
+    id: "<track>-<NNNN>"            # provided externally, do NOT invent
+    track:           one of [cloud, edge, mobile, tinyml, global]
+    level:           one of [L1, L2, L3, L4, L5, L6+]
+    zone:            one of [analyze, design, diagnosis, evaluation, fluency,
+                             implement, mastery, optimization, realization,
+                             recall, specification]
+    topic:           closed enum (87 topics; use the one in the gap input)
+    competency_area: one of [architecture, compute, cross-cutting, data,
+                             deployment, latency, memory, networking,
+                             optimization, parallelism, power, precision,
+                             reliability]
+    bloom_level:     one of [remember, understand, apply, analyze,
+                             evaluate, create]  # informs cognitive demand
+    title:            ≤ 120 chars, descriptive, no trailing period
+    scenario:         1-3 sentences setting up a concrete situation
+    question:         the explicit interrogative the candidate must answer
+    details.realistic_solution: 1-3 sentence high-quality answer
+    details.common_mistake:     "**The Pitfall:** ...\\n**The Rationale:** ...\\n**The Consequence:** ..."
+    details.napkin_math:        OPTIONAL but recommended for L3+
+    status:          MUST be "draft" (this is a candidate for review)
+    provenance:      MUST be "llm-draft"
+    requires_explanation: false (default)
+    expected_time_minutes: integer, ≥ 0  (typical: 5-15)
+
+LEVEL ↔ BLOOM ROUGH MAPPING:
+    L1 → remember          L2 → understand         L3 → apply / analyze
+    L4 → analyze           L5 → evaluate           L6+ → create
+
+  STRICT JSON OUTPUT FORMAT (no prose, no fences, no extra fields):
+  {
+    "title":  "<title>",
+    "scenario": "<scenario>",
+    "question": "<question>",
+    "zone":  "<zone>",
+    "bloom_level": "<bloom>",
+    "phase":   "training | inference | both",
+    "expected_time_minutes": <int>,
+    "tags": ["<tag>", ...],
+    "details": {
+      "realistic_solution": "<1-3 sentence answer>",
+      "common_mistake": "**The Pitfall:** ...\\n**The Rationale:** ...\\n**The Consequence:** ...",
+      "napkin_math": "**Assumptions & Constraints:** ...\\n\\n**Calculations:** ...\\n\\n**Conclusion:** ..."
+    }
+  }
+"""
+
+
+def question_payload(q: dict[str, Any]) -> dict[str, Any]:
+    """Compact view of an existing question to feed Gemini as context."""
+    d = q.get("details") or {}
+    return {
+        "id": q.get("id"),
+        "level": q.get("level"),
+        "zone": q.get("zone"),
+        "bloom_level": q.get("bloom_level"),
+        "title": q.get("title"),
+        "scenario": q.get("scenario"),
+        "question": q.get("question"),
+        "realistic_solution": d.get("realistic_solution"),
+    }
+
+
+def find_exemplars(
+    corpus: dict[str, dict],
+    track: str,
+    topic: str,
+    target_level: str,
+    skip_ids: set[str],
+    limit: int = 3,
+) -> list[dict]:
+    """Pick up to `limit` published questions in the same (track, topic) at
+    the target level. Used as style-and-cognitive-load exemplars for the
+    drafted question.
+    """
+    pool = [
+        q for q in corpus.values()
+        if q.get("track") == track
+        and q.get("topic") == topic
+        and q.get("level") == target_level
+        and q.get("status") == "published"
+        and q.get("id") not in skip_ids
+    ]
+    pool.sort(key=lambda q: q.get("id", ""))
+    return pool[:limit]
+
+
+def build_prompt(gap: dict, between: list[dict], exemplars: list[dict]) -> str:
+    parts = [
+        "You are an ML systems interview question author. Draft ONE candidate",
+        "question that fills the missing rung in a pedagogical chain.",
+        "",
+        SCHEMA_SUMMARY,
+        "",
+        f"GAP TO FILL:",
+        f"  track:           {gap['track']}",
+        f"  topic:           {gap['topic']}",
+        f"  target level:    {gap['missing_level']}",
+        f"  bridge between:  {gap['between']}",
+        f"  rationale:       {gap.get('rationale', '')}",
+        "",
+        "BETWEEN-QUESTIONS (these MUST flank the new question pedagogically):",
+        json.dumps([question_payload(q) for q in between], indent=2),
+        "",
+        "EXEMPLARS at the target level in the same (track, topic) — match",
+        "their voice and cognitive load (NOT their content):",
+        json.dumps([question_payload(q) for q in exemplars], indent=2) if exemplars
+        else "  (no in-bucket exemplars at this level — use the between-questions' style)",
+        "",
+        "AUTHORING RULES:",
+        "  - The new question MUST chain naturally between the two between-questions:",
+        "    Q[lower].level < new.level < Q[higher].level (or equal-level edges where",
+        "    one between-question is exactly at target_level — re-read the gap).",
+        "  - Same scenario/concept thread as the bridge — do NOT introduce a",
+        "    new system topic.",
+        "  - Cognitive load matches target Bloom: e.g. L3 (apply) asks the",
+        "    candidate to perform a calculation; L4 (analyze) asks for",
+        "    decomposition or root-cause; L5 (evaluate) asks for a",
+        "    trade-off judgment with quantitative basis.",
+        "  - realistic_solution is a high-quality, concise answer — NOT a",
+        "    rubric. common_mistake follows the **Pitfall / Rationale /",
+        "    Consequence** format. napkin_math has the **Assumptions /",
+        "    Calculations / Conclusion** format.",
+        "  - Avoid duplicating any title or scenario in the between or",
+        "    exemplar inputs.",
+        "  - Output ONLY the JSON object specified in the schema summary.",
+    ]
+    return "\n".join(parts)
+
+
+# ─── Gemini call ──────────────────────────────────────────────────────────
+
+
+def call_gemini(prompt: str, model: str = GEMINI_MODEL, timeout: int = 600) -> dict | None:
+    try:
+        result = subprocess.run(
+            ["gemini", "-m", model, "-p", prompt, "--yolo"],
+            capture_output=True, text=True, timeout=timeout,
+        )
+    except subprocess.TimeoutExpired:
+        return None
+
+    out = (result.stdout or "").strip()
+    if out.startswith("```"):
+        out = out.strip("`")
+        if out.startswith("json"):
+            out = out[4:].lstrip()
+    i = out.find("{")
+    j = out.rfind("}")
+    if i == -1 or j == -1:
+        if result.returncode != 0:
+            print(f"  gemini exit {result.returncode}: {(result.stderr or '')[:200]}",
+                  file=sys.stderr)
+        return None
+    try:
+        return json.loads(out[i:j+1])
+    except json.JSONDecodeError as e:
+        print(f"  JSON parse failed: {e}", file=sys.stderr)
+        return None
+
+
+# ─── draft assembly + validation ──────────────────────────────────────────
+
+
+def assemble_draft(
+    gap: dict,
+    response: dict,
+    qid: str,
+) -> dict[str, Any]:
+    """Build the full YAML body from Gemini's response + gap-derived fields."""
+    now = datetime.now(timezone.utc).isoformat(timespec="seconds")
+    details_in = response.get("details") or {}
+    return {
+        "schema_version": "1.0",
+        "id": qid,
+        "track": gap["track"],
+        "level": gap["missing_level"],
+        "zone": response.get("zone") or "analyze",
+        "topic": gap["topic"],
+        # competency_area must come from the bridge — the gap entry doesn't
+        # carry it, so we inherit from the between-question. assemble_draft
+        # is called with this already resolved by main(); see _competency.
+        "competency_area": gap.get("_competency_area"),
+        "bloom_level": response.get("bloom_level"),
+        "phase": response.get("phase") or "both",
+        "title": response.get("title", "").strip(),
+        "scenario": response.get("scenario", "").strip(),
+        "question": response.get("question", "").strip(),
+        "details": {
+            "realistic_solution": (details_in.get("realistic_solution") or "").strip(),
+            "common_mistake": (details_in.get("common_mistake") or "").strip() or None,
+            "napkin_math": (details_in.get("napkin_math") or "").strip() or None,
+        },
+        "status": "draft",
+        "provenance": "llm-draft",
+        "requires_explanation": False,
+        "expected_time_minutes": int(response.get("expected_time_minutes") or 10),
+        "tags": response.get("tags") or None,
+        "_authoring": {
+            "origin": GEMINI_MODEL,
+            "tool": "generate_question_for_gap.py",
+            "generated_at": now,
+            "gap": {
+                "between": gap["between"],
+                "missing_level": gap["missing_level"],
+                "rationale": gap.get("rationale"),
+            },
+        },
+    }
+
+
+def schema_validate(draft: dict[str, Any]) -> tuple[bool, str]:
+    """Run the draft through Pydantic Question. Returns (ok, error_text)."""
+    if Question is None:
+        return False, "vault_cli not importable; install with `pip install -e interviews/vault-cli/`"
+    # Strip our private metadata; the Pydantic model will accept extra by
+    # config, but we don't want it to surface as a validation surprise.
+    body = {k: v for k, v in draft.items() if not k.startswith("_")}
+    # Drop None-valued optional details so Pydantic gets a clean dict.
+    if isinstance(body.get("details"), dict):
+        body["details"] = {k: v for k, v in body["details"].items() if v is not None}
+    try:
+        Question.model_validate(body)
+        return True, ""
+    except Exception as e:  # pydantic ValidationError stringifies usefully
+        return False, str(e)
+
+
+def write_draft(draft: dict[str, Any], output_dir: Path) -> Path:
+    track = draft["track"]
+    area = draft["competency_area"]
+    qid = draft["id"]
+    target_dir = output_dir / track / area
+    target_dir.mkdir(parents=True, exist_ok=True)
+    target = target_dir / f"{qid}.yaml.draft"
+    with target.open("w", encoding="utf-8") as f:
+        yaml.safe_dump(draft, f, sort_keys=False, allow_unicode=True, width=100)
+    return target
+
+
+# ─── main ─────────────────────────────────────────────────────────────────
+
+
+def resolve_competency_area(gap: dict, corpus: dict[str, dict]) -> str | None:
+    """Inherit competency_area from the between-questions.
+
+    All published questions in the same (track, topic) bucket should agree on
+    competency_area (it's a topic-level invariant). We pick from the first
+    between question; if they disagree, prefer the lower-level one (since the
+    gap is bridging upward from it) and warn the caller.
+    """
+    for qid in gap.get("between", []):
+        q = corpus.get(qid)
+        if q and q.get("competency_area"):
+            return q["competency_area"]
+    return None
+
+
+def process_gap(
+    gap: dict,
+    corpus: dict[str, dict],
+    next_ids: dict[str, int],
+    output_dir: Path,
+    *,
+    dry_run: bool = False,
+) -> dict[str, Any]:
+    """Returns a one-row report describing the outcome."""
+    track = gap.get("track")
+    if not track or track not in next_ids:
+        next_ids[track] = 0
+    seq = next_ids[track]
+    qid = f"{track}-{seq:04d}"
+    next_ids[track] = seq + 1
+
+    between = [corpus[q] for q in gap.get("between", []) if q in corpus]
+    if len(between) < 1:
+        return {"qid": qid, "ok": False, "why": "no between-questions found in corpus",
+                "gap": gap}
+
+    competency = resolve_competency_area(gap, corpus)
+    if not competency:
+        return {"qid": qid, "ok": False, "why": "could not resolve competency_area",
+                "gap": gap}
+
+    exemplars = find_exemplars(
+        corpus,
+        track=track,
+        topic=gap["topic"],
+        target_level=gap["missing_level"],
+        skip_ids=set(gap.get("between", [])),
+        limit=3,
+    )
+
+    prompt = build_prompt(gap, between, exemplars)
+    if dry_run:
+        return {"qid": qid, "ok": True, "dry_run": True,
+                "prompt_chars": len(prompt),
+                "exemplars": [e["id"] for e in exemplars]}
+
+    response = call_gemini(prompt)
+    if response is None:
+        return {"qid": qid, "ok": False, "why": "no/unparsable Gemini response", "gap": gap}
+
+    gap_with_area = dict(gap)
+    gap_with_area["_competency_area"] = competency
+    draft = assemble_draft(gap_with_area, response, qid)
+
+    ok, why = schema_validate(draft)
+    if not ok:
+        return {"qid": qid, "ok": False, "why": f"schema: {why[:300]}",
+                "gap": gap, "draft": draft}
+
+    target = write_draft(draft, output_dir)
+    return {"qid": qid, "ok": True,
+            "path": str(target.relative_to(REPO_ROOT)),
+            "title": draft["title"],
+            "level": draft["level"],
+            "competency_area": draft["competency_area"]}
+
+
+def select_gaps(args: argparse.Namespace) -> list[dict]:
+    if args.gap_index is not None:
+        all_gaps = json.loads(Path(args.gaps_from or DEFAULT_GAPS).read_text(encoding="utf-8"))
+        return [all_gaps[args.gap_index]]
+    gaps_path = Path(args.gaps_from or DEFAULT_GAPS)
+    all_gaps = json.loads(gaps_path.read_text(encoding="utf-8"))
+    return all_gaps[: args.limit] if args.limit else all_gaps
+
+
+def main() -> int:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--gaps-from", type=Path,
+                    help=f"path to gaps JSON (default {DEFAULT_GAPS})")
+    ap.add_argument("--gap-index", type=int,
+                    help="process a single gap entry by 0-based index")
+    ap.add_argument("--limit", type=int, default=None,
+                    help="process at most N gaps from the file")
+    ap.add_argument("--output-dir", type=Path, default=QUESTIONS_DIR,
+                    help=f"target tree (default {QUESTIONS_DIR})")
+    ap.add_argument("--dry-run", action="store_true",
+                    help="resolve gaps + build prompts, but don't call Gemini")
+    args = ap.parse_args()
+
+    corpus = load_corpus_index()
+    existing_drafts = list(args.output_dir.rglob("*.yaml.draft"))
+    next_ids = next_ids_per_track(corpus, existing_drafts)
+    print(f"corpus: {len(corpus)} questions; "
+          f"existing drafts: {len(existing_drafts)}")
+    print(f"next-id allocator: {dict(sorted(next_ids.items()))}")
+
+    gaps = select_gaps(args)
+    print(f"processing {len(gaps)} gap(s)")
+
+    results: list[dict[str, Any]] = []
+    for i, gap in enumerate(gaps):
+        print(f"\n[{i+1}/{len(gaps)}] {gap.get('track')}/{gap.get('topic')} "
+              f"L?→{gap.get('missing_level')} between={gap.get('between')}")
+        if i > 0 and not args.dry_run:
+            time.sleep(INTER_CALL_DELAY_S)
+        r = process_gap(gap, corpus, next_ids, args.output_dir, dry_run=args.dry_run)
+        results.append(r)
+        if r.get("ok"):
+            print(f"  ✓ {r['qid']}: {r.get('path') or '(dry-run)'}")
+        else:
+            print(f"  ✗ {r['qid']}: {r.get('why')}")
+
+    n_ok = sum(1 for r in results if r.get("ok"))
+    print(f"\nDONE: {n_ok}/{len(results)} draft(s) written successfully")
+    return 0 if n_ok > 0 or args.dry_run else 1
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/interviews/vault-cli/scripts/validate_drafts.py
+++ b/interviews/vault-cli/scripts/validate_drafts.py
@@ -0,0 +1,492 @@
+#!/usr/bin/env python3
+"""Validate Gemini-authored draft questions (Phase 3.b).
+
+For each ``*.yaml.draft`` under interviews/vault/questions/, run a
+multi-gate scorecard:
+
+  1. schema      — Pydantic Question model (same gate as published)
+  2. originality — cosine vs nearest neighbour in the same (track, topic);
+                   reject if any neighbour exceeds the threshold (default 0.92)
+  3. level_fit   — Gemini-judge: "does this question's cognitive load match
+                   level=<L>?", calibrated against ≤5 existing L-level
+                   questions in the same topic.
+  4. coherence   — Gemini-judge: "are scenario / question /
+                   realistic_solution mutually consistent?"
+  5. bridge      — Gemini-judge: "does this question pedagogically chain
+                   between <between[0]> and <between[1]> from the gap?"
+
+A draft passes when **all** gates return "yes" (or skipped). Output:
+
+  - per-draft scorecard rows in interviews/vault/draft-validation-scorecard.json
+  - stdout summary: pass/fail counts + per-gate failure reasons
+
+Use case: pilot run lands ~30 drafts in the tree; this script tells the
+human reviewer which to look at first (passes) vs which to discard
+(failed bridge / failed coherence).
+
+The originality gate needs an embedding model. By default it loads
+BAAI/bge-small-en-v1.5 (the same model used for the corpus's
+embeddings.npz) so cosine values are directly comparable. Pass
+``--no-originality`` to skip if the model load is undesirable.
+
+The LLM-judge gates need ``gemini`` on PATH (gemini-3.1-pro-preview).
+Pass ``--no-llm-judge`` to skip those gates and only run schema +
+originality.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import subprocess
+import sys
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+import yaml
+
+REPO_ROOT = Path(__file__).resolve().parents[3]
+VAULT_DIR = REPO_ROOT / "interviews" / "vault"
+QUESTIONS_DIR = VAULT_DIR / "questions"
+EMBEDDINGS_PATH = VAULT_DIR / "embeddings.npz"
+DEFAULT_OUTPUT = VAULT_DIR / "draft-validation-scorecard.json"
+
+GEMINI_MODEL = "gemini-3.1-pro-preview"
+ORIGINALITY_THRESHOLD = 0.92  # cosine; >= this is "too duplicative"
+LEVEL_FIT_EXEMPLAR_LIMIT = 5
+
+try:
+    from vault_cli.models import Question
+except ImportError:
+    Question = None  # type: ignore
+
+
+# ─── corpus / drafts ──────────────────────────────────────────────────────
+
+
+def load_yaml(path: Path) -> dict | None:
+    try:
+        with path.open(encoding="utf-8") as f:
+            d = yaml.safe_load(f)
+    except Exception:
+        return None
+    return d if isinstance(d, dict) else None
+
+
+def load_corpus_index() -> dict[str, dict]:
+    out: dict[str, dict] = {}
+    for path in QUESTIONS_DIR.rglob("*.yaml"):
+        d = load_yaml(path)
+        if d and d.get("id"):
+            out[d["id"]] = d
+    return out
+
+
+def find_drafts(scope: Path | None = None) -> list[Path]:
+    root = scope or QUESTIONS_DIR
+    return sorted(root.rglob("*.yaml.draft"))
+
+
+def question_payload(q: dict[str, Any]) -> dict[str, Any]:
+    d = q.get("details") or {}
+    return {
+        "id": q.get("id"),
+        "level": q.get("level"),
+        "title": q.get("title"),
+        "scenario": q.get("scenario"),
+        "question": q.get("question"),
+        "realistic_solution": d.get("realistic_solution"),
+    }
+
+
+# ─── Gate 1: schema ───────────────────────────────────────────────────────
+
+
+def gate_schema(draft: dict[str, Any]) -> tuple[bool, str]:
+    if Question is None:
+        return False, "vault_cli not importable; pip install -e interviews/vault-cli/"
+    body = {k: v for k, v in draft.items() if not k.startswith("_")}
+    if isinstance(body.get("details"), dict):
+        body["details"] = {k: v for k, v in body["details"].items() if v is not None}
+    try:
+        Question.model_validate(body)
+        return True, ""
+    except Exception as e:
+        return False, str(e)[:300]
+
+
+# ─── Gate 2: originality (cosine vs neighbours) ───────────────────────────
+
+
+_embed_state: dict[str, Any] = {}
+
+
+def _load_embedding_model_and_corpus():
+    """Lazy: load BAAI/bge-small-en-v1.5 + corpus vectors once per run."""
+    if "model" in _embed_state:
+        return _embed_state
+    import numpy as np
+    from sentence_transformers import SentenceTransformer
+
+    if not EMBEDDINGS_PATH.exists():
+        raise FileNotFoundError(f"missing {EMBEDDINGS_PATH} — needed for originality gate")
+    npz = np.load(EMBEDDINGS_PATH, allow_pickle=True)
+    model_name = str(npz["model_name"])
+    model = SentenceTransformer(model_name)
+    _embed_state.update({
+        "model": model,
+        "model_name": model_name,
+        "vectors": npz["vectors"],          # (N, dim) L2-normalised
+        "qids": [str(x) for x in npz["qids"]],
+        "qid_to_row": {str(q): i for i, q in enumerate(npz["qids"])},
+    })
+    return _embed_state
+
+
+def gate_originality(
+    draft: dict[str, Any],
+    corpus: dict[str, dict],
+    threshold: float = ORIGINALITY_THRESHOLD,
+) -> tuple[bool, str, dict[str, Any]]:
+    """Return (ok, reason, detail).
+
+    detail carries the top-1 neighbour qid + cosine, useful for the human
+    reviewer to spot-check against.
+    """
+    import numpy as np
+    state = _load_embedding_model_and_corpus()
+    model = state["model"]
+    vectors = state["vectors"]
+    qids = state["qids"]
+    qid_to_row = state["qid_to_row"]
+
+    # Embed the draft (concat title + scenario + question — what the v1
+    # corpus embedding script also used for its rows).
+    text = "\n".join([
+        draft.get("title", "") or "",
+        draft.get("scenario", "") or "",
+        draft.get("question", "") or "",
+    ])
+    vec = model.encode([text], normalize_embeddings=True)[0]
+
+    # Restrict comparisons to the same (track, topic) bucket — that's
+    # where duplicates would actually matter.
+    track = draft.get("track")
+    topic = draft.get("topic")
+    bucket_qids = [
+        qid for qid, q in corpus.items()
+        if q.get("track") == track and q.get("topic") == topic
+        and qid in qid_to_row
+    ]
+    if not bucket_qids:
+        return True, "", {"note": "no in-bucket corpus neighbours; skipping"}
+
+    rows = np.array([qid_to_row[q] for q in bucket_qids], dtype=np.int64)
+    # cosine = dot product since both sides are L2-normalised
+    sims = vectors[rows] @ vec  # (len(rows),)
+    top = int(np.argmax(sims))
+    top_qid = bucket_qids[top]
+    top_cos = float(sims[top])
+
+    detail = {"top_neighbour": top_qid, "cosine": round(top_cos, 4),
+              "threshold": threshold, "bucket_size": len(bucket_qids)}
+    if top_cos >= threshold:
+        return False, f"too similar to {top_qid} (cosine={top_cos:.3f} >= {threshold})", detail
+    return True, "", detail
+
+
+# ─── Gate 3-5: Gemini judges ──────────────────────────────────────────────
+
+
+def call_gemini_judge(prompt: str, timeout: int = 240) -> dict | None:
+    """Single judge call; expects strict-JSON {"verdict": "yes|no", "rationale": "..."}."""
+    try:
+        result = subprocess.run(
+            ["gemini", "-m", GEMINI_MODEL, "-p", prompt, "--yolo"],
+            capture_output=True, text=True, timeout=timeout,
+        )
+    except subprocess.TimeoutExpired:
+        return None
+    out = (result.stdout or "").strip()
+    if out.startswith("```"):
+        out = out.strip("`")
+        if out.startswith("json"):
+            out = out[4:].lstrip()
+    i = out.find("{")
+    j = out.rfind("}")
+    if i == -1 or j == -1:
+        return None
+    try:
+        return json.loads(out[i:j+1])
+    except json.JSONDecodeError:
+        return None
+
+
+def _judge_block(draft: dict[str, Any]) -> str:
+    return json.dumps(question_payload(draft), indent=2)
+
+
+def gate_level_fit(draft: dict, corpus: dict[str, dict]) -> tuple[bool, str, dict]:
+    target_level = draft.get("level")
+    track = draft.get("track")
+    topic = draft.get("topic")
+    exemplars = sorted(
+        [q for q in corpus.values()
+         if q.get("track") == track and q.get("topic") == topic
+         and q.get("level") == target_level
+         and q.get("status") == "published"],
+        key=lambda q: q.get("id", ""),
+    )[:LEVEL_FIT_EXEMPLAR_LIMIT]
+
+    if not exemplars:
+        return True, "", {"note": f"no published L={target_level} exemplars in bucket; skipping"}
+
+    prompt = f"""You are calibrating cognitive load. Given an EXAMPLE PAIR of
+existing published interview questions at level={target_level} for
+track={track}, topic={topic}, judge whether the CANDIDATE question
+matches that level's typical cognitive demand.
+
+Bloom mapping: L1=remember, L2=understand, L3=apply, L4=analyze,
+L5=evaluate, L6+=create.
+
+EXEMPLARS at level={target_level}:
+{json.dumps([question_payload(q) for q in exemplars], indent=2)}
+
+CANDIDATE:
+{_judge_block(draft)}
+
+Return STRICT JSON with no prose or fences:
+{{"verdict": "yes" | "no", "rationale": "<one sentence>"}}
+"""
+    resp = call_gemini_judge(prompt)
+    if resp is None:
+        return False, "no judge response", {}
+    verdict = (resp.get("verdict") or "").strip().lower()
+    if verdict == "yes":
+        return True, "", {"rationale": resp.get("rationale", "")}
+    return False, f"level_fit=no: {resp.get('rationale', '')}", {"rationale": resp.get("rationale")}
+
+
+def gate_coherence(draft: dict) -> tuple[bool, str, dict]:
+    prompt = f"""Judge whether the scenario, question, and realistic_solution
+are MUTUALLY CONSISTENT. Specifically:
+  - Does the question logically follow from the scenario?
+  - Does the realistic_solution actually answer the question (not adjacent)?
+  - Are the numbers / system parameters internally consistent across all
+    three fields (no contradictions)?
+
+CANDIDATE:
+{_judge_block(draft)}
+
+Return STRICT JSON with no prose or fences:
+{{"verdict": "yes" | "no", "rationale": "<one sentence>"}}
+"""
+    resp = call_gemini_judge(prompt)
+    if resp is None:
+        return False, "no judge response", {}
+    verdict = (resp.get("verdict") or "").strip().lower()
+    if verdict == "yes":
+        return True, "", {"rationale": resp.get("rationale", "")}
+    return False, f"coherence=no: {resp.get('rationale', '')}", {"rationale": resp.get("rationale")}
+
+
+def gate_bridge(draft: dict, corpus: dict[str, dict]) -> tuple[bool, str, dict]:
+    auth = draft.get("_authoring") or {}
+    gap = auth.get("gap") or {}
+    between_ids = gap.get("between") or []
+    between = [corpus.get(q) for q in between_ids if corpus.get(q)]
+    if len(between) < 2:
+        # Without two between-questions we can't judge a bridge meaningfully.
+        return True, "", {"note": "fewer than 2 between-questions in corpus; skipping"}
+
+    prompt = f"""Judge whether the CANDIDATE question pedagogically chains
+between the two BETWEEN-questions. Specifically:
+  - Is the candidate's cognitive load above between[0]'s level and at or
+    below between[1]'s level (Bloom progression direction)?
+  - Does the candidate share scenario/concept thread with the between-
+    questions (not introducing a new system)?
+  - Would inserting the candidate between the two existing questions
+    produce a coherent +1 (or +2 last-resort) progression chain?
+
+BETWEEN[0] (lower):
+{json.dumps(question_payload(between[0]), indent=2)}
+
+BETWEEN[1] (higher):
+{json.dumps(question_payload(between[1]), indent=2)}
+
+CANDIDATE:
+{_judge_block(draft)}
+
+Return STRICT JSON with no prose or fences:
+{{"verdict": "yes" | "no", "rationale": "<one sentence>"}}
+"""
+    resp = call_gemini_judge(prompt)
+    if resp is None:
+        return False, "no judge response", {}
+    verdict = (resp.get("verdict") or "").strip().lower()
+    if verdict == "yes":
+        return True, "", {"rationale": resp.get("rationale", "")}
+    return False, f"bridge=no: {resp.get('rationale', '')}", {"rationale": resp.get("rationale")}
+
+
+# ─── runner ───────────────────────────────────────────────────────────────
+
+
+def evaluate_draft(
+    draft_path: Path,
+    corpus: dict[str, dict],
+    args: argparse.Namespace,
+) -> dict[str, Any]:
+    draft = load_yaml(draft_path)
+    if not draft:
+        return {"path": str(draft_path), "verdict": "fail",
+                "errors": ["could not load YAML"]}
+
+    try:
+        rel_path = str(draft_path.relative_to(REPO_ROOT))
+    except ValueError:
+        rel_path = str(draft_path)
+    rec: dict[str, Any] = {
+        "path": rel_path,
+        "draft_id": draft.get("id"),
+        "track": draft.get("track"),
+        "topic": draft.get("topic"),
+        "level": draft.get("level"),
+    }
+
+    # Gate 1 — schema (mandatory)
+    ok, why = gate_schema(draft)
+    rec["schema_ok"] = ok
+    if not ok:
+        rec["schema_error"] = why
+        rec["verdict"] = "fail"
+        return rec  # downstream gates assume a structurally valid YAML
+
+    # Gate 2 — originality
+    if args.no_originality:
+        rec["originality"] = "skipped"
+    else:
+        try:
+            ok, why, detail = gate_originality(draft, corpus, threshold=args.threshold)
+            rec["originality"] = "pass" if ok else "fail"
+            rec["originality_detail"] = detail
+            if not ok:
+                rec["originality_reason"] = why
+        except Exception as e:
+            rec["originality"] = "error"
+            rec["originality_reason"] = str(e)[:200]
+
+    # Gates 3-5 — Gemini judges
+    if args.no_llm_judge:
+        rec["level_fit"] = "skipped"
+        rec["coherence"] = "skipped"
+        rec["bridge"]    = "skipped"
+    else:
+        for name, gate in [("level_fit", gate_level_fit),
+                           ("coherence", gate_coherence),
+                           ("bridge", gate_bridge)]:
+            try:
+                if name == "coherence":
+                    ok, why, detail = gate(draft)
+                else:
+                    ok, why, detail = gate(draft, corpus)
+            except Exception as e:
+                rec[name] = "error"
+                rec[f"{name}_reason"] = str(e)[:200]
+                continue
+            rec[name] = "pass" if ok else "fail"
+            rec[f"{name}_detail"] = detail
+            if not ok:
+                rec[f"{name}_reason"] = why
+            time.sleep(args.judge_delay)  # be polite between calls
+
+    # Final verdict: pass iff every non-skipped gate is pass.
+    gate_results = [
+        rec.get("originality"),
+        rec.get("level_fit"),
+        rec.get("coherence"),
+        rec.get("bridge"),
+    ]
+    has_fail = any(r == "fail" for r in gate_results)
+    has_error = any(r == "error" for r in gate_results)
+    rec["verdict"] = "fail" if has_fail else ("error" if has_error else "pass")
+    return rec
+
+
+def main() -> int:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--scope", type=Path, default=None,
+                    help=f"directory tree to scan for *.yaml.draft "
+                         f"(default {QUESTIONS_DIR})")
+    ap.add_argument("--output", type=Path, default=DEFAULT_OUTPUT,
+                    help=f"scorecard JSON (default {DEFAULT_OUTPUT})")
+    ap.add_argument("--no-originality", action="store_true",
+                    help="skip the embedding-based originality gate")
+    ap.add_argument("--no-llm-judge", action="store_true",
+                    help="skip the Gemini-judge gates (level_fit, coherence, bridge)")
+    ap.add_argument("--threshold", type=float, default=ORIGINALITY_THRESHOLD,
+                    help=f"originality cosine cutoff (default {ORIGINALITY_THRESHOLD})")
+    ap.add_argument("--judge-delay", type=float, default=4.0,
+                    help="seconds between Gemini judge calls (default 4.0)")
+    ap.add_argument("--limit", type=int, default=None,
+                    help="evaluate only the first N drafts")
+    args = ap.parse_args()
+
+    drafts = find_drafts(args.scope)
+    if args.limit:
+        drafts = drafts[: args.limit]
+    if not drafts:
+        print(f"no *.yaml.draft files found under {args.scope or QUESTIONS_DIR}")
+        return 0
+
+    corpus = load_corpus_index()
+    print(f"corpus: {len(corpus)} published+draft questions; "
+          f"drafts to evaluate: {len(drafts)}")
+
+    rows: list[dict[str, Any]] = []
+    for i, p in enumerate(drafts, start=1):
+        try:
+            display = p.relative_to(REPO_ROOT)
+        except ValueError:
+            display = p
+        print(f"\n[{i}/{len(drafts)}] {display}")
+        rec = evaluate_draft(p, corpus, args)
+        gate_summary = ", ".join(
+            f"{g}={rec.get(g, '-')}"
+            for g in ("originality", "level_fit", "coherence", "bridge")
+        )
+        print(f"  verdict={rec.get('verdict'):4s}  {gate_summary}")
+        if rec.get("verdict") == "fail":
+            for k in ("schema_error", "originality_reason",
+                      "level_fit_reason", "coherence_reason", "bridge_reason"):
+                if k in rec:
+                    print(f"    {k}: {str(rec[k])[:200]}")
+        rows.append(rec)
+
+    try:
+        out_display = args.output.relative_to(REPO_ROOT)
+    except ValueError:
+        out_display = args.output
+    args.output.parent.mkdir(parents=True, exist_ok=True)
+    args.output.write_text(json.dumps({
+        "generated_at": datetime.now(timezone.utc).isoformat(timespec="seconds"),
+        "originality_threshold": args.threshold,
+        "drafts_evaluated": len(rows),
+        "passes": sum(1 for r in rows if r.get("verdict") == "pass"),
+        "fails":  sum(1 for r in rows if r.get("verdict") == "fail"),
+        "errors": sum(1 for r in rows if r.get("verdict") == "error"),
+        "rows": rows,
+    }, indent=2) + "\n")
+    print(f"\nwrote {out_display}")
+    n_pass = sum(1 for r in rows if r.get("verdict") == "pass")
+    n_fail = sum(1 for r in rows if r.get("verdict") == "fail")
+    n_err  = sum(1 for r in rows if r.get("verdict") == "error")
+    print(f"summary: pass={n_pass}  fail={n_fail}  error={n_err}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())