feat(vault-cli): Phase 3.a + 3.b — gap-driven authoring tooling

Two new scripts that together close the loop from a gap entry to a
reviewable candidate question with a multi-gate scorecard.

generate_question_for_gap.py (3.a):
  - Reads a gap entry, loads between-questions + same-bucket exemplars,
    prompts gemini-3.1-pro-preview, runs Pydantic Question validation,
    and writes <track>/<area>/<id>.yaml.draft. The .draft suffix keeps
    drafts out of vault check / vault build until promotion.
  - ID allocator scans corpus + existing drafts so a batch run gets
    distinct fresh IDs without touching id-registry.yaml.
  - Modes: --gap-index, --gaps-from + --limit, --dry-run.

validate_drafts.py (3.b):
  - Five gates per draft: schema (Pydantic), originality (cosine vs
    in-bucket neighbours via BAAI/bge-small-en-v1.5; matches the corpus
    embeddings.npz so values are comparable; cutoff 0.92), level_fit
    (Gemini-judge against same-level exemplars), coherence
    (Gemini-judge: scenario/question/solution consistency), and bridge
    (Gemini-judge: chain-fit between the gap's two anchors).
  - Final verdict pass iff every non-skipped gate passes.
  - Skips: --no-originality, --no-llm-judge.
  - Output: interviews/vault/draft-validation-scorecard.json.

Smoke checks:
  - 3.a --dry-run --gap-index 0: resolves gap, builds prompt, allocates
    cloud-4579. Synthetic Gemini response Pydantic-validates clean.
  - 3.b on a synthetic /tmp draft: schema + originality pass (top
    neighbour cosine 0.73 vs 0.92 threshold).

Phase 3.c (pilot run on 30 gaps) deferred: it generates new YAML
question content that needs human review before promotion. The
tooling ships ready; running it is a user-supervised step.

CHAIN_ROADMAP.md Progress Log + Phase 3 status updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Vijay Janapa Reddi
2026-05-01 11:31:06 -04:00
parent d94a79942c
commit 84b1fab082
3 changed files with 1085 additions and 2 deletions

View File

@@ -3,7 +3,7 @@
**Status:** active workstream
**Branch:** `yaml-audit` (off `dev`)
**Worktree:** `/Users/VJ/GitHub/MLSysBook-yaml-audit`
**Last updated:** 2026-05-01 (Phase 1 + 2 + 4.2/4.8 shipped — 879 chains, tier in UI, docs current)
**Last updated:** 2026-05-01 (Phase 3.a + 3.b tooling shipped; 3.c pilot deferred for review)
This document is the canonical resumable plan for the vault chain rebuild
+ corpus growth work. **Future Claude sessions: read the "Resume Here"
@@ -368,7 +368,7 @@ primary chains in default surfaces, exposes secondary in "more paths."
## Phase 3 — Gap-driven question authoring
**Status:** `not started`
**Status:** `tooling complete (3.a + 3.b); pilot 3.c deferred for review`
**Goal:** Use the 138+ entries in `gaps.proposed.json` to author new
questions filling missing rungs, validated independently before commit.
This is the durable corpus growth strategy.
@@ -966,5 +966,103 @@ available to review the first few generated drafts.
---
### 2026-05-01 — Phase 3.a + 3.b: authoring + validation tooling
**What was done:**
**Phase 3.a — `generate_question_for_gap.py`:**
- Reads a gap entry (`{track, topic, missing_level, between, rationale}`)
from gaps.proposed.json (or .lenient.json), loads the between-questions
in full + up to 3 same-bucket exemplars at the target level, prompts
Gemini-3.1-pro-preview with the schema summary + bridge context, and
writes a candidate question to
`interviews/vault/questions/<track>/<area>/<id>.yaml.draft`.
- ID allocator scans the existing corpus + already-written drafts so a
batch run gets distinct fresh IDs without touching `id-registry.yaml`
(registry append happens at promotion time, not generation).
- Authoring metadata stamped under a private `_authoring` block:
origin model, tool name, timestamp, and the source gap entry. The
Pydantic Question model has `extra="allow"`, so this passes schema.
- Modes: `--gap-index <N>` (single gap), `--gaps-from <path> --limit N`
(batch), `--dry-run` (build prompts without calling Gemini).
- Smoke checks:
- `--dry-run --gap-index 0` resolves the first gap, finds 3 exemplars,
builds the prompt, allocates `cloud-4579`. ✓
- Synthetic Gemini response → `assemble_draft` → `Question.model_validate`
passes; YAML preview looks right (12-field body, sensible details). ✓
**Phase 3.b — `validate_drafts.py`:**
- Five-gate scorecard per draft:
1. **schema** — Pydantic Question (mandatory; downstream gates skip
on schema fail to avoid spurious LLM calls)
2. **originality** — embeds `title + scenario + question` with
`BAAI/bge-small-en-v1.5` (matches the corpus embeddings.npz model
so cosines are directly comparable), compares against in-bucket
neighbors, flags any `cosine ≥ 0.92`
3. **level_fit** — Gemini-judge against ≤5 published exemplars at the
target level in the same (track, topic)
4. **coherence** — Gemini-judge: scenario / question /
realistic_solution mutually consistent
5. **bridge** — Gemini-judge: candidate genuinely chains between the
two `between` questions named in `_authoring.gap`
- Skips: `--no-originality` (skip embed model load),
`--no-llm-judge` (skip Gemini gates). Schema gate is unconditional.
- Output: `interviews/vault/draft-validation-scorecard.json` with per-row
detail + final verdict (`pass | fail | error`).
- Smoke check: synthetic draft in /tmp passed schema + originality
(top-neighbor cosine 0.73 vs 0.92 threshold). End-to-end runner
produced a well-formed scorecard. ✓
**What was deliberately not done tonight:**
- **Phase 3.c (pilot run on 30 highest-value gaps):** This generates
new YAML question content that needs human review *before* promotion.
Running 30 unsupervised generations and 30×4 LLM-judge calls without
the user available to spot-check the first few outputs is the wrong
shape of work for an overnight slot. The tooling is ready when the
user is.
- **Phase 3.d3.f:** Promotion + re-chain are downstream of 3.c
acceptance.
**Recommended pilot when the user is back:**
1. Pick 30 gaps from `gaps.proposed.lenient.json` where the bucket has
≥4 questions already (just missing the bridge):
```bash
python3 interviews/vault-cli/scripts/generate_question_for_gap.py \
--gaps-from interviews/vault/gaps.proposed.lenient.json \
--limit 30
```
2. Validate:
```bash
python3 interviews/vault-cli/scripts/validate_drafts.py
```
3. Manually review the passing drafts (~20-25 expected).
4. Promote: rename `.yaml.draft` → `.yaml`, append to id-registry.
5. Re-run `build_chains_with_gemini.py --all` so the new questions get
absorbed into chains.
**Files committed:**
- `interviews/vault-cli/scripts/generate_question_for_gap.py` (new)
- `interviews/vault-cli/scripts/validate_drafts.py` (new)
- `interviews/vault-cli/docs/CHAIN_ROADMAP.md` (this Progress Log entry +
status flips)
**Notes for next session:**
- Both scripts assume `gemini` CLI on PATH (gemini-3.1-pro-preview) and,
for originality, the corpus's `embeddings.npz` (gitignored, regenerable
by the existing embedding scripts). `validate_drafts --no-llm-judge`
is a fast first cut that only exercises schema + originality if you
want to triage drafts before paying for the LLM-judge calls.
- Heads up: each draft in 3.b consumes ~3 Gemini calls (level_fit +
coherence + bridge). 30 drafts → ~90 calls. Daily cap is 250.
- `id-registry.yaml` is append-only and CI-enforced. Promotion (3.d)
needs to add new IDs to it; that's not yet wired into a script —
manual append for the pilot, then we can extract a `vault promote`
helper from the pattern.
**Next step:** Phase 3.c — pilot run on 30 high-value gaps (best done
with the user available to spot-check the first few outputs).
---
<!-- Append new entries above this comment, in reverse chronological is fine,
but keep entries dated and self-contained for resume context. -->

View File

@@ -0,0 +1,493 @@
#!/usr/bin/env python3
"""Author a candidate question to fill a chain gap (Phase 3.a).
Reads a gap entry (from gaps.proposed.json / gaps.proposed.lenient.json)
that names two existing questions and a missing Bloom level between
them, then prompts Gemini-3.1-pro-preview to draft a bridging question
that fits the (track, topic, target-level) slot.
Inputs per gap entry:
{
"track": "edge",
"topic": "memory-mapped-inference",
"missing_level": "L3",
"between": ["edge-0220", "edge-0224"],
"rationale": "..."
}
Outputs per accepted draft:
interviews/vault/questions/<track>/<area>/<auto-id>.yaml.draft
— full question YAML with stamped authoring metadata. The .draft
suffix is intentional: vault check / vault build only load *.yaml,
so drafts ride along in the tree without affecting the release set
until they are promoted (renamed to .yaml) by a follow-up step.
Usage:
python3 generate_question_for_gap.py --gap-index 0
python3 generate_question_for_gap.py --gaps-from interviews/vault/gaps.proposed.json --limit 5
python3 generate_question_for_gap.py --gaps-from <path> --limit 30 --output-dir <dir>
This is the Phase 3.a tool. Validation (originality / level-fit /
coherence / bridge) is a separate concern handled by validate_drafts.py.
The only validation done here is structural Pydantic-schema acceptance,
which is the gate that prevents writing a malformed YAML to disk.
"""
from __future__ import annotations
import argparse
import json
import re
import subprocess
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import yaml
REPO_ROOT = Path(__file__).resolve().parents[3]
VAULT_DIR = REPO_ROOT / "interviews" / "vault"
QUESTIONS_DIR = VAULT_DIR / "questions"
ID_REGISTRY = VAULT_DIR / "id-registry.yaml"
DEFAULT_GAPS = VAULT_DIR / "gaps.proposed.json"
GEMINI_MODEL = "gemini-3.1-pro-preview"
INTER_CALL_DELAY_S = 6 # be polite to the Gemini CLI's rate limiter
# Imported lazily so the file is still readable as a script even if the
# vault_cli package isn't editable-installed in the current interpreter.
try:
from vault_cli.models import Question
except ImportError: # pragma: no cover
Question = None # type: ignore
# ─── corpus + registry helpers ────────────────────────────────────────────
def load_corpus_index() -> dict[str, dict]:
"""qid → full YAML dict for every published question.
We need full bodies (scenario + details) for the between-questions and
exemplars; the corpus.json summary doesn't carry them.
"""
out: dict[str, dict] = {}
for path in QUESTIONS_DIR.rglob("*.yaml"):
try:
with path.open(encoding="utf-8") as f:
d = yaml.safe_load(f)
except Exception:
continue
if isinstance(d, dict) and d.get("id"):
out[d["id"]] = d
return out
def next_ids_per_track(corpus: dict[str, dict], existing_drafts: list[Path]) -> dict[str, int]:
"""Return per-track next-available numeric suffix.
Considers BOTH committed YAMLs in the corpus AND any .yaml.draft files
written in earlier runs of this script — so a batch generating 30 drafts
gets 30 distinct IDs even before any of them is promoted into the
id-registry.
"""
max_for_track: dict[str, int] = {}
pat = re.compile(r"^([a-z]+)-(\d+)$")
for qid in corpus:
m = pat.match(qid)
if not m:
continue
track, num = m.group(1), int(m.group(2))
if num > max_for_track.get(track, -1):
max_for_track[track] = num
for draft in existing_drafts:
# filename like edge-2545.yaml.draft
stem = draft.name.split(".")[0]
m = pat.match(stem)
if m:
track, num = m.group(1), int(m.group(2))
if num > max_for_track.get(track, -1):
max_for_track[track] = num
return {t: n + 1 for t, n in max_for_track.items()}
# ─── prompt construction ──────────────────────────────────────────────────
SCHEMA_SUMMARY = """SCHEMA SUMMARY (Pydantic Question, v1.0):
REQUIRED FIELDS:
schema_version: "1.0"
id: "<track>-<NNNN>" # provided externally, do NOT invent
track: one of [cloud, edge, mobile, tinyml, global]
level: one of [L1, L2, L3, L4, L5, L6+]
zone: one of [analyze, design, diagnosis, evaluation, fluency,
implement, mastery, optimization, realization,
recall, specification]
topic: closed enum (87 topics; use the one in the gap input)
competency_area: one of [architecture, compute, cross-cutting, data,
deployment, latency, memory, networking,
optimization, parallelism, power, precision,
reliability]
bloom_level: one of [remember, understand, apply, analyze,
evaluate, create] # informs cognitive demand
title: ≤ 120 chars, descriptive, no trailing period
scenario: 1-3 sentences setting up a concrete situation
question: the explicit interrogative the candidate must answer
details.realistic_solution: 1-3 sentence high-quality answer
details.common_mistake: "**The Pitfall:** ...\\n**The Rationale:** ...\\n**The Consequence:** ..."
details.napkin_math: OPTIONAL but recommended for L3+
status: MUST be "draft" (this is a candidate for review)
provenance: MUST be "llm-draft"
requires_explanation: false (default)
expected_time_minutes: integer, ≥ 0 (typical: 5-15)
LEVEL ↔ BLOOM ROUGH MAPPING:
L1 → remember L2 → understand L3 → apply / analyze
L4 → analyze L5 → evaluate L6+ → create
STRICT JSON OUTPUT FORMAT (no prose, no fences, no extra fields):
{
"title": "<title>",
"scenario": "<scenario>",
"question": "<question>",
"zone": "<zone>",
"bloom_level": "<bloom>",
"phase": "training | inference | both",
"expected_time_minutes": <int>,
"tags": ["<tag>", ...],
"details": {
"realistic_solution": "<1-3 sentence answer>",
"common_mistake": "**The Pitfall:** ...\\n**The Rationale:** ...\\n**The Consequence:** ...",
"napkin_math": "**Assumptions & Constraints:** ...\\n\\n**Calculations:** ...\\n\\n**Conclusion:** ..."
}
}
"""
def question_payload(q: dict[str, Any]) -> dict[str, Any]:
"""Compact view of an existing question to feed Gemini as context."""
d = q.get("details") or {}
return {
"id": q.get("id"),
"level": q.get("level"),
"zone": q.get("zone"),
"bloom_level": q.get("bloom_level"),
"title": q.get("title"),
"scenario": q.get("scenario"),
"question": q.get("question"),
"realistic_solution": d.get("realistic_solution"),
}
def find_exemplars(
corpus: dict[str, dict],
track: str,
topic: str,
target_level: str,
skip_ids: set[str],
limit: int = 3,
) -> list[dict]:
"""Pick up to `limit` published questions in the same (track, topic) at
the target level. Used as style-and-cognitive-load exemplars for the
drafted question.
"""
pool = [
q for q in corpus.values()
if q.get("track") == track
and q.get("topic") == topic
and q.get("level") == target_level
and q.get("status") == "published"
and q.get("id") not in skip_ids
]
pool.sort(key=lambda q: q.get("id", ""))
return pool[:limit]
def build_prompt(gap: dict, between: list[dict], exemplars: list[dict]) -> str:
parts = [
"You are an ML systems interview question author. Draft ONE candidate",
"question that fills the missing rung in a pedagogical chain.",
"",
SCHEMA_SUMMARY,
"",
f"GAP TO FILL:",
f" track: {gap['track']}",
f" topic: {gap['topic']}",
f" target level: {gap['missing_level']}",
f" bridge between: {gap['between']}",
f" rationale: {gap.get('rationale', '')}",
"",
"BETWEEN-QUESTIONS (these MUST flank the new question pedagogically):",
json.dumps([question_payload(q) for q in between], indent=2),
"",
"EXEMPLARS at the target level in the same (track, topic) — match",
"their voice and cognitive load (NOT their content):",
json.dumps([question_payload(q) for q in exemplars], indent=2) if exemplars
else " (no in-bucket exemplars at this level — use the between-questions' style)",
"",
"AUTHORING RULES:",
" - The new question MUST chain naturally between the two between-questions:",
" Q[lower].level < new.level < Q[higher].level (or equal-level edges where",
" one between-question is exactly at target_level — re-read the gap).",
" - Same scenario/concept thread as the bridge — do NOT introduce a",
" new system topic.",
" - Cognitive load matches target Bloom: e.g. L3 (apply) asks the",
" candidate to perform a calculation; L4 (analyze) asks for",
" decomposition or root-cause; L5 (evaluate) asks for a",
" trade-off judgment with quantitative basis.",
" - realistic_solution is a high-quality, concise answer — NOT a",
" rubric. common_mistake follows the **Pitfall / Rationale /",
" Consequence** format. napkin_math has the **Assumptions /",
" Calculations / Conclusion** format.",
" - Avoid duplicating any title or scenario in the between or",
" exemplar inputs.",
" - Output ONLY the JSON object specified in the schema summary.",
]
return "\n".join(parts)
# ─── Gemini call ──────────────────────────────────────────────────────────
def call_gemini(prompt: str, model: str = GEMINI_MODEL, timeout: int = 600) -> dict | None:
try:
result = subprocess.run(
["gemini", "-m", model, "-p", prompt, "--yolo"],
capture_output=True, text=True, timeout=timeout,
)
except subprocess.TimeoutExpired:
return None
out = (result.stdout or "").strip()
if out.startswith("```"):
out = out.strip("`")
if out.startswith("json"):
out = out[4:].lstrip()
i = out.find("{")
j = out.rfind("}")
if i == -1 or j == -1:
if result.returncode != 0:
print(f" gemini exit {result.returncode}: {(result.stderr or '')[:200]}",
file=sys.stderr)
return None
try:
return json.loads(out[i:j+1])
except json.JSONDecodeError as e:
print(f" JSON parse failed: {e}", file=sys.stderr)
return None
# ─── draft assembly + validation ──────────────────────────────────────────
def assemble_draft(
gap: dict,
response: dict,
qid: str,
) -> dict[str, Any]:
"""Build the full YAML body from Gemini's response + gap-derived fields."""
now = datetime.now(timezone.utc).isoformat(timespec="seconds")
details_in = response.get("details") or {}
return {
"schema_version": "1.0",
"id": qid,
"track": gap["track"],
"level": gap["missing_level"],
"zone": response.get("zone") or "analyze",
"topic": gap["topic"],
# competency_area must come from the bridge — the gap entry doesn't
# carry it, so we inherit from the between-question. assemble_draft
# is called with this already resolved by main(); see _competency.
"competency_area": gap.get("_competency_area"),
"bloom_level": response.get("bloom_level"),
"phase": response.get("phase") or "both",
"title": response.get("title", "").strip(),
"scenario": response.get("scenario", "").strip(),
"question": response.get("question", "").strip(),
"details": {
"realistic_solution": (details_in.get("realistic_solution") or "").strip(),
"common_mistake": (details_in.get("common_mistake") or "").strip() or None,
"napkin_math": (details_in.get("napkin_math") or "").strip() or None,
},
"status": "draft",
"provenance": "llm-draft",
"requires_explanation": False,
"expected_time_minutes": int(response.get("expected_time_minutes") or 10),
"tags": response.get("tags") or None,
"_authoring": {
"origin": GEMINI_MODEL,
"tool": "generate_question_for_gap.py",
"generated_at": now,
"gap": {
"between": gap["between"],
"missing_level": gap["missing_level"],
"rationale": gap.get("rationale"),
},
},
}
def schema_validate(draft: dict[str, Any]) -> tuple[bool, str]:
"""Run the draft through Pydantic Question. Returns (ok, error_text)."""
if Question is None:
return False, "vault_cli not importable; install with `pip install -e interviews/vault-cli/`"
# Strip our private metadata; the Pydantic model will accept extra by
# config, but we don't want it to surface as a validation surprise.
body = {k: v for k, v in draft.items() if not k.startswith("_")}
# Drop None-valued optional details so Pydantic gets a clean dict.
if isinstance(body.get("details"), dict):
body["details"] = {k: v for k, v in body["details"].items() if v is not None}
try:
Question.model_validate(body)
return True, ""
except Exception as e: # pydantic ValidationError stringifies usefully
return False, str(e)
def write_draft(draft: dict[str, Any], output_dir: Path) -> Path:
track = draft["track"]
area = draft["competency_area"]
qid = draft["id"]
target_dir = output_dir / track / area
target_dir.mkdir(parents=True, exist_ok=True)
target = target_dir / f"{qid}.yaml.draft"
with target.open("w", encoding="utf-8") as f:
yaml.safe_dump(draft, f, sort_keys=False, allow_unicode=True, width=100)
return target
# ─── main ─────────────────────────────────────────────────────────────────
def resolve_competency_area(gap: dict, corpus: dict[str, dict]) -> str | None:
"""Inherit competency_area from the between-questions.
All published questions in the same (track, topic) bucket should agree on
competency_area (it's a topic-level invariant). We pick from the first
between question; if they disagree, prefer the lower-level one (since the
gap is bridging upward from it) and warn the caller.
"""
for qid in gap.get("between", []):
q = corpus.get(qid)
if q and q.get("competency_area"):
return q["competency_area"]
return None
def process_gap(
gap: dict,
corpus: dict[str, dict],
next_ids: dict[str, int],
output_dir: Path,
*,
dry_run: bool = False,
) -> dict[str, Any]:
"""Returns a one-row report describing the outcome."""
track = gap.get("track")
if not track or track not in next_ids:
next_ids[track] = 0
seq = next_ids[track]
qid = f"{track}-{seq:04d}"
next_ids[track] = seq + 1
between = [corpus[q] for q in gap.get("between", []) if q in corpus]
if len(between) < 1:
return {"qid": qid, "ok": False, "why": "no between-questions found in corpus",
"gap": gap}
competency = resolve_competency_area(gap, corpus)
if not competency:
return {"qid": qid, "ok": False, "why": "could not resolve competency_area",
"gap": gap}
exemplars = find_exemplars(
corpus,
track=track,
topic=gap["topic"],
target_level=gap["missing_level"],
skip_ids=set(gap.get("between", [])),
limit=3,
)
prompt = build_prompt(gap, between, exemplars)
if dry_run:
return {"qid": qid, "ok": True, "dry_run": True,
"prompt_chars": len(prompt),
"exemplars": [e["id"] for e in exemplars]}
response = call_gemini(prompt)
if response is None:
return {"qid": qid, "ok": False, "why": "no/unparsable Gemini response", "gap": gap}
gap_with_area = dict(gap)
gap_with_area["_competency_area"] = competency
draft = assemble_draft(gap_with_area, response, qid)
ok, why = schema_validate(draft)
if not ok:
return {"qid": qid, "ok": False, "why": f"schema: {why[:300]}",
"gap": gap, "draft": draft}
target = write_draft(draft, output_dir)
return {"qid": qid, "ok": True,
"path": str(target.relative_to(REPO_ROOT)),
"title": draft["title"],
"level": draft["level"],
"competency_area": draft["competency_area"]}
def select_gaps(args: argparse.Namespace) -> list[dict]:
if args.gap_index is not None:
all_gaps = json.loads(Path(args.gaps_from or DEFAULT_GAPS).read_text(encoding="utf-8"))
return [all_gaps[args.gap_index]]
gaps_path = Path(args.gaps_from or DEFAULT_GAPS)
all_gaps = json.loads(gaps_path.read_text(encoding="utf-8"))
return all_gaps[: args.limit] if args.limit else all_gaps
def main() -> int:
ap = argparse.ArgumentParser(description=__doc__)
ap.add_argument("--gaps-from", type=Path,
help=f"path to gaps JSON (default {DEFAULT_GAPS})")
ap.add_argument("--gap-index", type=int,
help="process a single gap entry by 0-based index")
ap.add_argument("--limit", type=int, default=None,
help="process at most N gaps from the file")
ap.add_argument("--output-dir", type=Path, default=QUESTIONS_DIR,
help=f"target tree (default {QUESTIONS_DIR})")
ap.add_argument("--dry-run", action="store_true",
help="resolve gaps + build prompts, but don't call Gemini")
args = ap.parse_args()
corpus = load_corpus_index()
existing_drafts = list(args.output_dir.rglob("*.yaml.draft"))
next_ids = next_ids_per_track(corpus, existing_drafts)
print(f"corpus: {len(corpus)} questions; "
f"existing drafts: {len(existing_drafts)}")
print(f"next-id allocator: {dict(sorted(next_ids.items()))}")
gaps = select_gaps(args)
print(f"processing {len(gaps)} gap(s)")
results: list[dict[str, Any]] = []
for i, gap in enumerate(gaps):
print(f"\n[{i+1}/{len(gaps)}] {gap.get('track')}/{gap.get('topic')} "
f"L?→{gap.get('missing_level')} between={gap.get('between')}")
if i > 0 and not args.dry_run:
time.sleep(INTER_CALL_DELAY_S)
r = process_gap(gap, corpus, next_ids, args.output_dir, dry_run=args.dry_run)
results.append(r)
if r.get("ok"):
print(f"{r['qid']}: {r.get('path') or '(dry-run)'}")
else:
print(f"{r['qid']}: {r.get('why')}")
n_ok = sum(1 for r in results if r.get("ok"))
print(f"\nDONE: {n_ok}/{len(results)} draft(s) written successfully")
return 0 if n_ok > 0 or args.dry_run else 1
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -0,0 +1,492 @@
#!/usr/bin/env python3
"""Validate Gemini-authored draft questions (Phase 3.b).
For each ``*.yaml.draft`` under interviews/vault/questions/, run a
multi-gate scorecard:
1. schema — Pydantic Question model (same gate as published)
2. originality — cosine vs nearest neighbour in the same (track, topic);
reject if any neighbour exceeds the threshold (default 0.92)
3. level_fit — Gemini-judge: "does this question's cognitive load match
level=<L>?", calibrated against ≤5 existing L-level
questions in the same topic.
4. coherence — Gemini-judge: "are scenario / question /
realistic_solution mutually consistent?"
5. bridge — Gemini-judge: "does this question pedagogically chain
between <between[0]> and <between[1]> from the gap?"
A draft passes when **all** gates return "yes" (or skipped). Output:
- per-draft scorecard rows in interviews/vault/draft-validation-scorecard.json
- stdout summary: pass/fail counts + per-gate failure reasons
Use case: pilot run lands ~30 drafts in the tree; this script tells the
human reviewer which to look at first (passes) vs which to discard
(failed bridge / failed coherence).
The originality gate needs an embedding model. By default it loads
BAAI/bge-small-en-v1.5 (the same model used for the corpus's
embeddings.npz) so cosine values are directly comparable. Pass
``--no-originality`` to skip if the model load is undesirable.
The LLM-judge gates need ``gemini`` on PATH (gemini-3.1-pro-preview).
Pass ``--no-llm-judge`` to skip those gates and only run schema +
originality.
"""
from __future__ import annotations
import argparse
import json
import re
import subprocess
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import yaml
REPO_ROOT = Path(__file__).resolve().parents[3]
VAULT_DIR = REPO_ROOT / "interviews" / "vault"
QUESTIONS_DIR = VAULT_DIR / "questions"
EMBEDDINGS_PATH = VAULT_DIR / "embeddings.npz"
DEFAULT_OUTPUT = VAULT_DIR / "draft-validation-scorecard.json"
GEMINI_MODEL = "gemini-3.1-pro-preview"
ORIGINALITY_THRESHOLD = 0.92 # cosine; >= this is "too duplicative"
LEVEL_FIT_EXEMPLAR_LIMIT = 5
try:
from vault_cli.models import Question
except ImportError:
Question = None # type: ignore
# ─── corpus / drafts ──────────────────────────────────────────────────────
def load_yaml(path: Path) -> dict | None:
try:
with path.open(encoding="utf-8") as f:
d = yaml.safe_load(f)
except Exception:
return None
return d if isinstance(d, dict) else None
def load_corpus_index() -> dict[str, dict]:
out: dict[str, dict] = {}
for path in QUESTIONS_DIR.rglob("*.yaml"):
d = load_yaml(path)
if d and d.get("id"):
out[d["id"]] = d
return out
def find_drafts(scope: Path | None = None) -> list[Path]:
root = scope or QUESTIONS_DIR
return sorted(root.rglob("*.yaml.draft"))
def question_payload(q: dict[str, Any]) -> dict[str, Any]:
d = q.get("details") or {}
return {
"id": q.get("id"),
"level": q.get("level"),
"title": q.get("title"),
"scenario": q.get("scenario"),
"question": q.get("question"),
"realistic_solution": d.get("realistic_solution"),
}
# ─── Gate 1: schema ───────────────────────────────────────────────────────
def gate_schema(draft: dict[str, Any]) -> tuple[bool, str]:
if Question is None:
return False, "vault_cli not importable; pip install -e interviews/vault-cli/"
body = {k: v for k, v in draft.items() if not k.startswith("_")}
if isinstance(body.get("details"), dict):
body["details"] = {k: v for k, v in body["details"].items() if v is not None}
try:
Question.model_validate(body)
return True, ""
except Exception as e:
return False, str(e)[:300]
# ─── Gate 2: originality (cosine vs neighbours) ───────────────────────────
_embed_state: dict[str, Any] = {}
def _load_embedding_model_and_corpus():
"""Lazy: load BAAI/bge-small-en-v1.5 + corpus vectors once per run."""
if "model" in _embed_state:
return _embed_state
import numpy as np
from sentence_transformers import SentenceTransformer
if not EMBEDDINGS_PATH.exists():
raise FileNotFoundError(f"missing {EMBEDDINGS_PATH} — needed for originality gate")
npz = np.load(EMBEDDINGS_PATH, allow_pickle=True)
model_name = str(npz["model_name"])
model = SentenceTransformer(model_name)
_embed_state.update({
"model": model,
"model_name": model_name,
"vectors": npz["vectors"], # (N, dim) L2-normalised
"qids": [str(x) for x in npz["qids"]],
"qid_to_row": {str(q): i for i, q in enumerate(npz["qids"])},
})
return _embed_state
def gate_originality(
draft: dict[str, Any],
corpus: dict[str, dict],
threshold: float = ORIGINALITY_THRESHOLD,
) -> tuple[bool, str, dict[str, Any]]:
"""Return (ok, reason, detail).
detail carries the top-1 neighbour qid + cosine, useful for the human
reviewer to spot-check against.
"""
import numpy as np
state = _load_embedding_model_and_corpus()
model = state["model"]
vectors = state["vectors"]
qids = state["qids"]
qid_to_row = state["qid_to_row"]
# Embed the draft (concat title + scenario + question — what the v1
# corpus embedding script also used for its rows).
text = "\n".join([
draft.get("title", "") or "",
draft.get("scenario", "") or "",
draft.get("question", "") or "",
])
vec = model.encode([text], normalize_embeddings=True)[0]
# Restrict comparisons to the same (track, topic) bucket — that's
# where duplicates would actually matter.
track = draft.get("track")
topic = draft.get("topic")
bucket_qids = [
qid for qid, q in corpus.items()
if q.get("track") == track and q.get("topic") == topic
and qid in qid_to_row
]
if not bucket_qids:
return True, "", {"note": "no in-bucket corpus neighbours; skipping"}
rows = np.array([qid_to_row[q] for q in bucket_qids], dtype=np.int64)
# cosine = dot product since both sides are L2-normalised
sims = vectors[rows] @ vec # (len(rows),)
top = int(np.argmax(sims))
top_qid = bucket_qids[top]
top_cos = float(sims[top])
detail = {"top_neighbour": top_qid, "cosine": round(top_cos, 4),
"threshold": threshold, "bucket_size": len(bucket_qids)}
if top_cos >= threshold:
return False, f"too similar to {top_qid} (cosine={top_cos:.3f} >= {threshold})", detail
return True, "", detail
# ─── Gate 3-5: Gemini judges ──────────────────────────────────────────────
def call_gemini_judge(prompt: str, timeout: int = 240) -> dict | None:
"""Single judge call; expects strict-JSON {"verdict": "yes|no", "rationale": "..."}."""
try:
result = subprocess.run(
["gemini", "-m", GEMINI_MODEL, "-p", prompt, "--yolo"],
capture_output=True, text=True, timeout=timeout,
)
except subprocess.TimeoutExpired:
return None
out = (result.stdout or "").strip()
if out.startswith("```"):
out = out.strip("`")
if out.startswith("json"):
out = out[4:].lstrip()
i = out.find("{")
j = out.rfind("}")
if i == -1 or j == -1:
return None
try:
return json.loads(out[i:j+1])
except json.JSONDecodeError:
return None
def _judge_block(draft: dict[str, Any]) -> str:
return json.dumps(question_payload(draft), indent=2)
def gate_level_fit(draft: dict, corpus: dict[str, dict]) -> tuple[bool, str, dict]:
target_level = draft.get("level")
track = draft.get("track")
topic = draft.get("topic")
exemplars = sorted(
[q for q in corpus.values()
if q.get("track") == track and q.get("topic") == topic
and q.get("level") == target_level
and q.get("status") == "published"],
key=lambda q: q.get("id", ""),
)[:LEVEL_FIT_EXEMPLAR_LIMIT]
if not exemplars:
return True, "", {"note": f"no published L={target_level} exemplars in bucket; skipping"}
prompt = f"""You are calibrating cognitive load. Given an EXAMPLE PAIR of
existing published interview questions at level={target_level} for
track={track}, topic={topic}, judge whether the CANDIDATE question
matches that level's typical cognitive demand.
Bloom mapping: L1=remember, L2=understand, L3=apply, L4=analyze,
L5=evaluate, L6+=create.
EXEMPLARS at level={target_level}:
{json.dumps([question_payload(q) for q in exemplars], indent=2)}
CANDIDATE:
{_judge_block(draft)}
Return STRICT JSON with no prose or fences:
{{"verdict": "yes" | "no", "rationale": "<one sentence>"}}
"""
resp = call_gemini_judge(prompt)
if resp is None:
return False, "no judge response", {}
verdict = (resp.get("verdict") or "").strip().lower()
if verdict == "yes":
return True, "", {"rationale": resp.get("rationale", "")}
return False, f"level_fit=no: {resp.get('rationale', '')}", {"rationale": resp.get("rationale")}
def gate_coherence(draft: dict) -> tuple[bool, str, dict]:
prompt = f"""Judge whether the scenario, question, and realistic_solution
are MUTUALLY CONSISTENT. Specifically:
- Does the question logically follow from the scenario?
- Does the realistic_solution actually answer the question (not adjacent)?
- Are the numbers / system parameters internally consistent across all
three fields (no contradictions)?
CANDIDATE:
{_judge_block(draft)}
Return STRICT JSON with no prose or fences:
{{"verdict": "yes" | "no", "rationale": "<one sentence>"}}
"""
resp = call_gemini_judge(prompt)
if resp is None:
return False, "no judge response", {}
verdict = (resp.get("verdict") or "").strip().lower()
if verdict == "yes":
return True, "", {"rationale": resp.get("rationale", "")}
return False, f"coherence=no: {resp.get('rationale', '')}", {"rationale": resp.get("rationale")}
def gate_bridge(draft: dict, corpus: dict[str, dict]) -> tuple[bool, str, dict]:
auth = draft.get("_authoring") or {}
gap = auth.get("gap") or {}
between_ids = gap.get("between") or []
between = [corpus.get(q) for q in between_ids if corpus.get(q)]
if len(between) < 2:
# Without two between-questions we can't judge a bridge meaningfully.
return True, "", {"note": "fewer than 2 between-questions in corpus; skipping"}
prompt = f"""Judge whether the CANDIDATE question pedagogically chains
between the two BETWEEN-questions. Specifically:
- Is the candidate's cognitive load above between[0]'s level and at or
below between[1]'s level (Bloom progression direction)?
- Does the candidate share scenario/concept thread with the between-
questions (not introducing a new system)?
- Would inserting the candidate between the two existing questions
produce a coherent +1 (or +2 last-resort) progression chain?
BETWEEN[0] (lower):
{json.dumps(question_payload(between[0]), indent=2)}
BETWEEN[1] (higher):
{json.dumps(question_payload(between[1]), indent=2)}
CANDIDATE:
{_judge_block(draft)}
Return STRICT JSON with no prose or fences:
{{"verdict": "yes" | "no", "rationale": "<one sentence>"}}
"""
resp = call_gemini_judge(prompt)
if resp is None:
return False, "no judge response", {}
verdict = (resp.get("verdict") or "").strip().lower()
if verdict == "yes":
return True, "", {"rationale": resp.get("rationale", "")}
return False, f"bridge=no: {resp.get('rationale', '')}", {"rationale": resp.get("rationale")}
# ─── runner ───────────────────────────────────────────────────────────────
def evaluate_draft(
draft_path: Path,
corpus: dict[str, dict],
args: argparse.Namespace,
) -> dict[str, Any]:
draft = load_yaml(draft_path)
if not draft:
return {"path": str(draft_path), "verdict": "fail",
"errors": ["could not load YAML"]}
try:
rel_path = str(draft_path.relative_to(REPO_ROOT))
except ValueError:
rel_path = str(draft_path)
rec: dict[str, Any] = {
"path": rel_path,
"draft_id": draft.get("id"),
"track": draft.get("track"),
"topic": draft.get("topic"),
"level": draft.get("level"),
}
# Gate 1 — schema (mandatory)
ok, why = gate_schema(draft)
rec["schema_ok"] = ok
if not ok:
rec["schema_error"] = why
rec["verdict"] = "fail"
return rec # downstream gates assume a structurally valid YAML
# Gate 2 — originality
if args.no_originality:
rec["originality"] = "skipped"
else:
try:
ok, why, detail = gate_originality(draft, corpus, threshold=args.threshold)
rec["originality"] = "pass" if ok else "fail"
rec["originality_detail"] = detail
if not ok:
rec["originality_reason"] = why
except Exception as e:
rec["originality"] = "error"
rec["originality_reason"] = str(e)[:200]
# Gates 3-5 — Gemini judges
if args.no_llm_judge:
rec["level_fit"] = "skipped"
rec["coherence"] = "skipped"
rec["bridge"] = "skipped"
else:
for name, gate in [("level_fit", gate_level_fit),
("coherence", gate_coherence),
("bridge", gate_bridge)]:
try:
if name == "coherence":
ok, why, detail = gate(draft)
else:
ok, why, detail = gate(draft, corpus)
except Exception as e:
rec[name] = "error"
rec[f"{name}_reason"] = str(e)[:200]
continue
rec[name] = "pass" if ok else "fail"
rec[f"{name}_detail"] = detail
if not ok:
rec[f"{name}_reason"] = why
time.sleep(args.judge_delay) # be polite between calls
# Final verdict: pass iff every non-skipped gate is pass.
gate_results = [
rec.get("originality"),
rec.get("level_fit"),
rec.get("coherence"),
rec.get("bridge"),
]
has_fail = any(r == "fail" for r in gate_results)
has_error = any(r == "error" for r in gate_results)
rec["verdict"] = "fail" if has_fail else ("error" if has_error else "pass")
return rec
def main() -> int:
ap = argparse.ArgumentParser(description=__doc__)
ap.add_argument("--scope", type=Path, default=None,
help=f"directory tree to scan for *.yaml.draft "
f"(default {QUESTIONS_DIR})")
ap.add_argument("--output", type=Path, default=DEFAULT_OUTPUT,
help=f"scorecard JSON (default {DEFAULT_OUTPUT})")
ap.add_argument("--no-originality", action="store_true",
help="skip the embedding-based originality gate")
ap.add_argument("--no-llm-judge", action="store_true",
help="skip the Gemini-judge gates (level_fit, coherence, bridge)")
ap.add_argument("--threshold", type=float, default=ORIGINALITY_THRESHOLD,
help=f"originality cosine cutoff (default {ORIGINALITY_THRESHOLD})")
ap.add_argument("--judge-delay", type=float, default=4.0,
help="seconds between Gemini judge calls (default 4.0)")
ap.add_argument("--limit", type=int, default=None,
help="evaluate only the first N drafts")
args = ap.parse_args()
drafts = find_drafts(args.scope)
if args.limit:
drafts = drafts[: args.limit]
if not drafts:
print(f"no *.yaml.draft files found under {args.scope or QUESTIONS_DIR}")
return 0
corpus = load_corpus_index()
print(f"corpus: {len(corpus)} published+draft questions; "
f"drafts to evaluate: {len(drafts)}")
rows: list[dict[str, Any]] = []
for i, p in enumerate(drafts, start=1):
try:
display = p.relative_to(REPO_ROOT)
except ValueError:
display = p
print(f"\n[{i}/{len(drafts)}] {display}")
rec = evaluate_draft(p, corpus, args)
gate_summary = ", ".join(
f"{g}={rec.get(g, '-')}"
for g in ("originality", "level_fit", "coherence", "bridge")
)
print(f" verdict={rec.get('verdict'):4s} {gate_summary}")
if rec.get("verdict") == "fail":
for k in ("schema_error", "originality_reason",
"level_fit_reason", "coherence_reason", "bridge_reason"):
if k in rec:
print(f" {k}: {str(rec[k])[:200]}")
rows.append(rec)
try:
out_display = args.output.relative_to(REPO_ROOT)
except ValueError:
out_display = args.output
args.output.parent.mkdir(parents=True, exist_ok=True)
args.output.write_text(json.dumps({
"generated_at": datetime.now(timezone.utc).isoformat(timespec="seconds"),
"originality_threshold": args.threshold,
"drafts_evaluated": len(rows),
"passes": sum(1 for r in rows if r.get("verdict") == "pass"),
"fails": sum(1 for r in rows if r.get("verdict") == "fail"),
"errors": sum(1 for r in rows if r.get("verdict") == "error"),
"rows": rows,
}, indent=2) + "\n")
print(f"\nwrote {out_display}")
n_pass = sum(1 for r in rows if r.get("verdict") == "pass")
n_fail = sum(1 for r in rows if r.get("verdict") == "fail")
n_err = sum(1 for r in rows if r.get("verdict") == "error")
print(f"summary: pass={n_pass} fail={n_fail} error={n_err}")
return 0
if __name__ == "__main__":
raise SystemExit(main())