mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 02:03:55 -05:00
feat(vault-cli): Phase 3.a + 3.b — gap-driven authoring tooling
Two new scripts that together close the loop from a gap entry to a
reviewable candidate question with a multi-gate scorecard.
generate_question_for_gap.py (3.a):
- Reads a gap entry, loads between-questions + same-bucket exemplars,
prompts gemini-3.1-pro-preview, runs Pydantic Question validation,
and writes <track>/<area>/<id>.yaml.draft. The .draft suffix keeps
drafts out of vault check / vault build until promotion.
- ID allocator scans corpus + existing drafts so a batch run gets
distinct fresh IDs without touching id-registry.yaml.
- Modes: --gap-index, --gaps-from + --limit, --dry-run.
validate_drafts.py (3.b):
- Five gates per draft: schema (Pydantic), originality (cosine vs
in-bucket neighbours via BAAI/bge-small-en-v1.5; matches the corpus
embeddings.npz so values are comparable; cutoff 0.92), level_fit
(Gemini-judge against same-level exemplars), coherence
(Gemini-judge: scenario/question/solution consistency), and bridge
(Gemini-judge: chain-fit between the gap's two anchors).
- Final verdict pass iff every non-skipped gate passes.
- Skips: --no-originality, --no-llm-judge.
- Output: interviews/vault/draft-validation-scorecard.json.
Smoke checks:
- 3.a --dry-run --gap-index 0: resolves gap, builds prompt, allocates
cloud-4579. Synthetic Gemini response Pydantic-validates clean.
- 3.b on a synthetic /tmp draft: schema + originality pass (top
neighbour cosine 0.73 vs 0.92 threshold).
Phase 3.c (pilot run on 30 gaps) deferred: it generates new YAML
question content that needs human review before promotion. The
tooling ships ready; running it is a user-supervised step.
CHAIN_ROADMAP.md Progress Log + Phase 3 status updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,7 +3,7 @@
|
||||
**Status:** active workstream
|
||||
**Branch:** `yaml-audit` (off `dev`)
|
||||
**Worktree:** `/Users/VJ/GitHub/MLSysBook-yaml-audit`
|
||||
**Last updated:** 2026-05-01 (Phase 1 + 2 + 4.2/4.8 shipped — 879 chains, tier in UI, docs current)
|
||||
**Last updated:** 2026-05-01 (Phase 3.a + 3.b tooling shipped; 3.c pilot deferred for review)
|
||||
|
||||
This document is the canonical resumable plan for the vault chain rebuild
|
||||
+ corpus growth work. **Future Claude sessions: read the "Resume Here"
|
||||
@@ -368,7 +368,7 @@ primary chains in default surfaces, exposes secondary in "more paths."
|
||||
|
||||
## Phase 3 — Gap-driven question authoring
|
||||
|
||||
**Status:** `not started`
|
||||
**Status:** `tooling complete (3.a + 3.b); pilot 3.c deferred for review`
|
||||
**Goal:** Use the 138+ entries in `gaps.proposed.json` to author new
|
||||
questions filling missing rungs, validated independently before commit.
|
||||
This is the durable corpus growth strategy.
|
||||
@@ -966,5 +966,103 @@ available to review the first few generated drafts.
|
||||
|
||||
---
|
||||
|
||||
### 2026-05-01 — Phase 3.a + 3.b: authoring + validation tooling
|
||||
|
||||
**What was done:**
|
||||
|
||||
**Phase 3.a — `generate_question_for_gap.py`:**
|
||||
- Reads a gap entry (`{track, topic, missing_level, between, rationale}`)
|
||||
from gaps.proposed.json (or .lenient.json), loads the between-questions
|
||||
in full + up to 3 same-bucket exemplars at the target level, prompts
|
||||
Gemini-3.1-pro-preview with the schema summary + bridge context, and
|
||||
writes a candidate question to
|
||||
`interviews/vault/questions/<track>/<area>/<id>.yaml.draft`.
|
||||
- ID allocator scans the existing corpus + already-written drafts so a
|
||||
batch run gets distinct fresh IDs without touching `id-registry.yaml`
|
||||
(registry append happens at promotion time, not generation).
|
||||
- Authoring metadata stamped under a private `_authoring` block:
|
||||
origin model, tool name, timestamp, and the source gap entry. The
|
||||
Pydantic Question model has `extra="allow"`, so this passes schema.
|
||||
- Modes: `--gap-index <N>` (single gap), `--gaps-from <path> --limit N`
|
||||
(batch), `--dry-run` (build prompts without calling Gemini).
|
||||
- Smoke checks:
|
||||
- `--dry-run --gap-index 0` resolves the first gap, finds 3 exemplars,
|
||||
builds the prompt, allocates `cloud-4579`. ✓
|
||||
- Synthetic Gemini response → `assemble_draft` → `Question.model_validate`
|
||||
passes; YAML preview looks right (12-field body, sensible details). ✓
|
||||
|
||||
**Phase 3.b — `validate_drafts.py`:**
|
||||
- Five-gate scorecard per draft:
|
||||
1. **schema** — Pydantic Question (mandatory; downstream gates skip
|
||||
on schema fail to avoid spurious LLM calls)
|
||||
2. **originality** — embeds `title + scenario + question` with
|
||||
`BAAI/bge-small-en-v1.5` (matches the corpus embeddings.npz model
|
||||
so cosines are directly comparable), compares against in-bucket
|
||||
neighbors, flags any `cosine ≥ 0.92`
|
||||
3. **level_fit** — Gemini-judge against ≤5 published exemplars at the
|
||||
target level in the same (track, topic)
|
||||
4. **coherence** — Gemini-judge: scenario / question /
|
||||
realistic_solution mutually consistent
|
||||
5. **bridge** — Gemini-judge: candidate genuinely chains between the
|
||||
two `between` questions named in `_authoring.gap`
|
||||
- Skips: `--no-originality` (skip embed model load),
|
||||
`--no-llm-judge` (skip Gemini gates). Schema gate is unconditional.
|
||||
- Output: `interviews/vault/draft-validation-scorecard.json` with per-row
|
||||
detail + final verdict (`pass | fail | error`).
|
||||
- Smoke check: synthetic draft in /tmp passed schema + originality
|
||||
(top-neighbor cosine 0.73 vs 0.92 threshold). End-to-end runner
|
||||
produced a well-formed scorecard. ✓
|
||||
|
||||
**What was deliberately not done tonight:**
|
||||
- **Phase 3.c (pilot run on 30 highest-value gaps):** This generates
|
||||
new YAML question content that needs human review *before* promotion.
|
||||
Running 30 unsupervised generations and 30×4 LLM-judge calls without
|
||||
the user available to spot-check the first few outputs is the wrong
|
||||
shape of work for an overnight slot. The tooling is ready when the
|
||||
user is.
|
||||
- **Phase 3.d–3.f:** Promotion + re-chain are downstream of 3.c
|
||||
acceptance.
|
||||
|
||||
**Recommended pilot when the user is back:**
|
||||
1. Pick 30 gaps from `gaps.proposed.lenient.json` where the bucket has
|
||||
≥4 questions already (just missing the bridge):
|
||||
```bash
|
||||
python3 interviews/vault-cli/scripts/generate_question_for_gap.py \
|
||||
--gaps-from interviews/vault/gaps.proposed.lenient.json \
|
||||
--limit 30
|
||||
```
|
||||
2. Validate:
|
||||
```bash
|
||||
python3 interviews/vault-cli/scripts/validate_drafts.py
|
||||
```
|
||||
3. Manually review the passing drafts (~20-25 expected).
|
||||
4. Promote: rename `.yaml.draft` → `.yaml`, append to id-registry.
|
||||
5. Re-run `build_chains_with_gemini.py --all` so the new questions get
|
||||
absorbed into chains.
|
||||
|
||||
**Files committed:**
|
||||
- `interviews/vault-cli/scripts/generate_question_for_gap.py` (new)
|
||||
- `interviews/vault-cli/scripts/validate_drafts.py` (new)
|
||||
- `interviews/vault-cli/docs/CHAIN_ROADMAP.md` (this Progress Log entry +
|
||||
status flips)
|
||||
|
||||
**Notes for next session:**
|
||||
- Both scripts assume `gemini` CLI on PATH (gemini-3.1-pro-preview) and,
|
||||
for originality, the corpus's `embeddings.npz` (gitignored, regenerable
|
||||
by the existing embedding scripts). `validate_drafts --no-llm-judge`
|
||||
is a fast first cut that only exercises schema + originality if you
|
||||
want to triage drafts before paying for the LLM-judge calls.
|
||||
- Heads up: each draft in 3.b consumes ~3 Gemini calls (level_fit +
|
||||
coherence + bridge). 30 drafts → ~90 calls. Daily cap is 250.
|
||||
- `id-registry.yaml` is append-only and CI-enforced. Promotion (3.d)
|
||||
needs to add new IDs to it; that's not yet wired into a script —
|
||||
manual append for the pilot, then we can extract a `vault promote`
|
||||
helper from the pattern.
|
||||
|
||||
**Next step:** Phase 3.c — pilot run on 30 high-value gaps (best done
|
||||
with the user available to spot-check the first few outputs).
|
||||
|
||||
---
|
||||
|
||||
<!-- Append new entries above this comment, in reverse chronological is fine,
|
||||
but keep entries dated and self-contained for resume context. -->
|
||||
|
||||
493
interviews/vault-cli/scripts/generate_question_for_gap.py
Executable file
493
interviews/vault-cli/scripts/generate_question_for_gap.py
Executable file
@@ -0,0 +1,493 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Author a candidate question to fill a chain gap (Phase 3.a).
|
||||
|
||||
Reads a gap entry (from gaps.proposed.json / gaps.proposed.lenient.json)
|
||||
that names two existing questions and a missing Bloom level between
|
||||
them, then prompts Gemini-3.1-pro-preview to draft a bridging question
|
||||
that fits the (track, topic, target-level) slot.
|
||||
|
||||
Inputs per gap entry:
|
||||
{
|
||||
"track": "edge",
|
||||
"topic": "memory-mapped-inference",
|
||||
"missing_level": "L3",
|
||||
"between": ["edge-0220", "edge-0224"],
|
||||
"rationale": "..."
|
||||
}
|
||||
|
||||
Outputs per accepted draft:
|
||||
interviews/vault/questions/<track>/<area>/<auto-id>.yaml.draft
|
||||
— full question YAML with stamped authoring metadata. The .draft
|
||||
suffix is intentional: vault check / vault build only load *.yaml,
|
||||
so drafts ride along in the tree without affecting the release set
|
||||
until they are promoted (renamed to .yaml) by a follow-up step.
|
||||
|
||||
Usage:
|
||||
python3 generate_question_for_gap.py --gap-index 0
|
||||
python3 generate_question_for_gap.py --gaps-from interviews/vault/gaps.proposed.json --limit 5
|
||||
python3 generate_question_for_gap.py --gaps-from <path> --limit 30 --output-dir <dir>
|
||||
|
||||
This is the Phase 3.a tool. Validation (originality / level-fit /
|
||||
coherence / bridge) is a separate concern handled by validate_drafts.py.
|
||||
The only validation done here is structural Pydantic-schema acceptance,
|
||||
which is the gate that prevents writing a malformed YAML to disk.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import yaml
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[3]
|
||||
VAULT_DIR = REPO_ROOT / "interviews" / "vault"
|
||||
QUESTIONS_DIR = VAULT_DIR / "questions"
|
||||
ID_REGISTRY = VAULT_DIR / "id-registry.yaml"
|
||||
DEFAULT_GAPS = VAULT_DIR / "gaps.proposed.json"
|
||||
|
||||
GEMINI_MODEL = "gemini-3.1-pro-preview"
|
||||
INTER_CALL_DELAY_S = 6 # be polite to the Gemini CLI's rate limiter
|
||||
|
||||
# Imported lazily so the file is still readable as a script even if the
|
||||
# vault_cli package isn't editable-installed in the current interpreter.
|
||||
try:
|
||||
from vault_cli.models import Question
|
||||
except ImportError: # pragma: no cover
|
||||
Question = None # type: ignore
|
||||
|
||||
|
||||
# ─── corpus + registry helpers ────────────────────────────────────────────
|
||||
|
||||
|
||||
def load_corpus_index() -> dict[str, dict]:
|
||||
"""qid → full YAML dict for every published question.
|
||||
|
||||
We need full bodies (scenario + details) for the between-questions and
|
||||
exemplars; the corpus.json summary doesn't carry them.
|
||||
"""
|
||||
out: dict[str, dict] = {}
|
||||
for path in QUESTIONS_DIR.rglob("*.yaml"):
|
||||
try:
|
||||
with path.open(encoding="utf-8") as f:
|
||||
d = yaml.safe_load(f)
|
||||
except Exception:
|
||||
continue
|
||||
if isinstance(d, dict) and d.get("id"):
|
||||
out[d["id"]] = d
|
||||
return out
|
||||
|
||||
|
||||
def next_ids_per_track(corpus: dict[str, dict], existing_drafts: list[Path]) -> dict[str, int]:
|
||||
"""Return per-track next-available numeric suffix.
|
||||
|
||||
Considers BOTH committed YAMLs in the corpus AND any .yaml.draft files
|
||||
written in earlier runs of this script — so a batch generating 30 drafts
|
||||
gets 30 distinct IDs even before any of them is promoted into the
|
||||
id-registry.
|
||||
"""
|
||||
max_for_track: dict[str, int] = {}
|
||||
pat = re.compile(r"^([a-z]+)-(\d+)$")
|
||||
for qid in corpus:
|
||||
m = pat.match(qid)
|
||||
if not m:
|
||||
continue
|
||||
track, num = m.group(1), int(m.group(2))
|
||||
if num > max_for_track.get(track, -1):
|
||||
max_for_track[track] = num
|
||||
for draft in existing_drafts:
|
||||
# filename like edge-2545.yaml.draft
|
||||
stem = draft.name.split(".")[0]
|
||||
m = pat.match(stem)
|
||||
if m:
|
||||
track, num = m.group(1), int(m.group(2))
|
||||
if num > max_for_track.get(track, -1):
|
||||
max_for_track[track] = num
|
||||
return {t: n + 1 for t, n in max_for_track.items()}
|
||||
|
||||
|
||||
# ─── prompt construction ──────────────────────────────────────────────────
|
||||
|
||||
|
||||
SCHEMA_SUMMARY = """SCHEMA SUMMARY (Pydantic Question, v1.0):
|
||||
REQUIRED FIELDS:
|
||||
schema_version: "1.0"
|
||||
id: "<track>-<NNNN>" # provided externally, do NOT invent
|
||||
track: one of [cloud, edge, mobile, tinyml, global]
|
||||
level: one of [L1, L2, L3, L4, L5, L6+]
|
||||
zone: one of [analyze, design, diagnosis, evaluation, fluency,
|
||||
implement, mastery, optimization, realization,
|
||||
recall, specification]
|
||||
topic: closed enum (87 topics; use the one in the gap input)
|
||||
competency_area: one of [architecture, compute, cross-cutting, data,
|
||||
deployment, latency, memory, networking,
|
||||
optimization, parallelism, power, precision,
|
||||
reliability]
|
||||
bloom_level: one of [remember, understand, apply, analyze,
|
||||
evaluate, create] # informs cognitive demand
|
||||
title: ≤ 120 chars, descriptive, no trailing period
|
||||
scenario: 1-3 sentences setting up a concrete situation
|
||||
question: the explicit interrogative the candidate must answer
|
||||
details.realistic_solution: 1-3 sentence high-quality answer
|
||||
details.common_mistake: "**The Pitfall:** ...\\n**The Rationale:** ...\\n**The Consequence:** ..."
|
||||
details.napkin_math: OPTIONAL but recommended for L3+
|
||||
status: MUST be "draft" (this is a candidate for review)
|
||||
provenance: MUST be "llm-draft"
|
||||
requires_explanation: false (default)
|
||||
expected_time_minutes: integer, ≥ 0 (typical: 5-15)
|
||||
|
||||
LEVEL ↔ BLOOM ROUGH MAPPING:
|
||||
L1 → remember L2 → understand L3 → apply / analyze
|
||||
L4 → analyze L5 → evaluate L6+ → create
|
||||
|
||||
STRICT JSON OUTPUT FORMAT (no prose, no fences, no extra fields):
|
||||
{
|
||||
"title": "<title>",
|
||||
"scenario": "<scenario>",
|
||||
"question": "<question>",
|
||||
"zone": "<zone>",
|
||||
"bloom_level": "<bloom>",
|
||||
"phase": "training | inference | both",
|
||||
"expected_time_minutes": <int>,
|
||||
"tags": ["<tag>", ...],
|
||||
"details": {
|
||||
"realistic_solution": "<1-3 sentence answer>",
|
||||
"common_mistake": "**The Pitfall:** ...\\n**The Rationale:** ...\\n**The Consequence:** ...",
|
||||
"napkin_math": "**Assumptions & Constraints:** ...\\n\\n**Calculations:** ...\\n\\n**Conclusion:** ..."
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
|
||||
def question_payload(q: dict[str, Any]) -> dict[str, Any]:
|
||||
"""Compact view of an existing question to feed Gemini as context."""
|
||||
d = q.get("details") or {}
|
||||
return {
|
||||
"id": q.get("id"),
|
||||
"level": q.get("level"),
|
||||
"zone": q.get("zone"),
|
||||
"bloom_level": q.get("bloom_level"),
|
||||
"title": q.get("title"),
|
||||
"scenario": q.get("scenario"),
|
||||
"question": q.get("question"),
|
||||
"realistic_solution": d.get("realistic_solution"),
|
||||
}
|
||||
|
||||
|
||||
def find_exemplars(
|
||||
corpus: dict[str, dict],
|
||||
track: str,
|
||||
topic: str,
|
||||
target_level: str,
|
||||
skip_ids: set[str],
|
||||
limit: int = 3,
|
||||
) -> list[dict]:
|
||||
"""Pick up to `limit` published questions in the same (track, topic) at
|
||||
the target level. Used as style-and-cognitive-load exemplars for the
|
||||
drafted question.
|
||||
"""
|
||||
pool = [
|
||||
q for q in corpus.values()
|
||||
if q.get("track") == track
|
||||
and q.get("topic") == topic
|
||||
and q.get("level") == target_level
|
||||
and q.get("status") == "published"
|
||||
and q.get("id") not in skip_ids
|
||||
]
|
||||
pool.sort(key=lambda q: q.get("id", ""))
|
||||
return pool[:limit]
|
||||
|
||||
|
||||
def build_prompt(gap: dict, between: list[dict], exemplars: list[dict]) -> str:
|
||||
parts = [
|
||||
"You are an ML systems interview question author. Draft ONE candidate",
|
||||
"question that fills the missing rung in a pedagogical chain.",
|
||||
"",
|
||||
SCHEMA_SUMMARY,
|
||||
"",
|
||||
f"GAP TO FILL:",
|
||||
f" track: {gap['track']}",
|
||||
f" topic: {gap['topic']}",
|
||||
f" target level: {gap['missing_level']}",
|
||||
f" bridge between: {gap['between']}",
|
||||
f" rationale: {gap.get('rationale', '')}",
|
||||
"",
|
||||
"BETWEEN-QUESTIONS (these MUST flank the new question pedagogically):",
|
||||
json.dumps([question_payload(q) for q in between], indent=2),
|
||||
"",
|
||||
"EXEMPLARS at the target level in the same (track, topic) — match",
|
||||
"their voice and cognitive load (NOT their content):",
|
||||
json.dumps([question_payload(q) for q in exemplars], indent=2) if exemplars
|
||||
else " (no in-bucket exemplars at this level — use the between-questions' style)",
|
||||
"",
|
||||
"AUTHORING RULES:",
|
||||
" - The new question MUST chain naturally between the two between-questions:",
|
||||
" Q[lower].level < new.level < Q[higher].level (or equal-level edges where",
|
||||
" one between-question is exactly at target_level — re-read the gap).",
|
||||
" - Same scenario/concept thread as the bridge — do NOT introduce a",
|
||||
" new system topic.",
|
||||
" - Cognitive load matches target Bloom: e.g. L3 (apply) asks the",
|
||||
" candidate to perform a calculation; L4 (analyze) asks for",
|
||||
" decomposition or root-cause; L5 (evaluate) asks for a",
|
||||
" trade-off judgment with quantitative basis.",
|
||||
" - realistic_solution is a high-quality, concise answer — NOT a",
|
||||
" rubric. common_mistake follows the **Pitfall / Rationale /",
|
||||
" Consequence** format. napkin_math has the **Assumptions /",
|
||||
" Calculations / Conclusion** format.",
|
||||
" - Avoid duplicating any title or scenario in the between or",
|
||||
" exemplar inputs.",
|
||||
" - Output ONLY the JSON object specified in the schema summary.",
|
||||
]
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
# ─── Gemini call ──────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def call_gemini(prompt: str, model: str = GEMINI_MODEL, timeout: int = 600) -> dict | None:
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["gemini", "-m", model, "-p", prompt, "--yolo"],
|
||||
capture_output=True, text=True, timeout=timeout,
|
||||
)
|
||||
except subprocess.TimeoutExpired:
|
||||
return None
|
||||
|
||||
out = (result.stdout or "").strip()
|
||||
if out.startswith("```"):
|
||||
out = out.strip("`")
|
||||
if out.startswith("json"):
|
||||
out = out[4:].lstrip()
|
||||
i = out.find("{")
|
||||
j = out.rfind("}")
|
||||
if i == -1 or j == -1:
|
||||
if result.returncode != 0:
|
||||
print(f" gemini exit {result.returncode}: {(result.stderr or '')[:200]}",
|
||||
file=sys.stderr)
|
||||
return None
|
||||
try:
|
||||
return json.loads(out[i:j+1])
|
||||
except json.JSONDecodeError as e:
|
||||
print(f" JSON parse failed: {e}", file=sys.stderr)
|
||||
return None
|
||||
|
||||
|
||||
# ─── draft assembly + validation ──────────────────────────────────────────
|
||||
|
||||
|
||||
def assemble_draft(
|
||||
gap: dict,
|
||||
response: dict,
|
||||
qid: str,
|
||||
) -> dict[str, Any]:
|
||||
"""Build the full YAML body from Gemini's response + gap-derived fields."""
|
||||
now = datetime.now(timezone.utc).isoformat(timespec="seconds")
|
||||
details_in = response.get("details") or {}
|
||||
return {
|
||||
"schema_version": "1.0",
|
||||
"id": qid,
|
||||
"track": gap["track"],
|
||||
"level": gap["missing_level"],
|
||||
"zone": response.get("zone") or "analyze",
|
||||
"topic": gap["topic"],
|
||||
# competency_area must come from the bridge — the gap entry doesn't
|
||||
# carry it, so we inherit from the between-question. assemble_draft
|
||||
# is called with this already resolved by main(); see _competency.
|
||||
"competency_area": gap.get("_competency_area"),
|
||||
"bloom_level": response.get("bloom_level"),
|
||||
"phase": response.get("phase") or "both",
|
||||
"title": response.get("title", "").strip(),
|
||||
"scenario": response.get("scenario", "").strip(),
|
||||
"question": response.get("question", "").strip(),
|
||||
"details": {
|
||||
"realistic_solution": (details_in.get("realistic_solution") or "").strip(),
|
||||
"common_mistake": (details_in.get("common_mistake") or "").strip() or None,
|
||||
"napkin_math": (details_in.get("napkin_math") or "").strip() or None,
|
||||
},
|
||||
"status": "draft",
|
||||
"provenance": "llm-draft",
|
||||
"requires_explanation": False,
|
||||
"expected_time_minutes": int(response.get("expected_time_minutes") or 10),
|
||||
"tags": response.get("tags") or None,
|
||||
"_authoring": {
|
||||
"origin": GEMINI_MODEL,
|
||||
"tool": "generate_question_for_gap.py",
|
||||
"generated_at": now,
|
||||
"gap": {
|
||||
"between": gap["between"],
|
||||
"missing_level": gap["missing_level"],
|
||||
"rationale": gap.get("rationale"),
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def schema_validate(draft: dict[str, Any]) -> tuple[bool, str]:
|
||||
"""Run the draft through Pydantic Question. Returns (ok, error_text)."""
|
||||
if Question is None:
|
||||
return False, "vault_cli not importable; install with `pip install -e interviews/vault-cli/`"
|
||||
# Strip our private metadata; the Pydantic model will accept extra by
|
||||
# config, but we don't want it to surface as a validation surprise.
|
||||
body = {k: v for k, v in draft.items() if not k.startswith("_")}
|
||||
# Drop None-valued optional details so Pydantic gets a clean dict.
|
||||
if isinstance(body.get("details"), dict):
|
||||
body["details"] = {k: v for k, v in body["details"].items() if v is not None}
|
||||
try:
|
||||
Question.model_validate(body)
|
||||
return True, ""
|
||||
except Exception as e: # pydantic ValidationError stringifies usefully
|
||||
return False, str(e)
|
||||
|
||||
|
||||
def write_draft(draft: dict[str, Any], output_dir: Path) -> Path:
|
||||
track = draft["track"]
|
||||
area = draft["competency_area"]
|
||||
qid = draft["id"]
|
||||
target_dir = output_dir / track / area
|
||||
target_dir.mkdir(parents=True, exist_ok=True)
|
||||
target = target_dir / f"{qid}.yaml.draft"
|
||||
with target.open("w", encoding="utf-8") as f:
|
||||
yaml.safe_dump(draft, f, sort_keys=False, allow_unicode=True, width=100)
|
||||
return target
|
||||
|
||||
|
||||
# ─── main ─────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def resolve_competency_area(gap: dict, corpus: dict[str, dict]) -> str | None:
|
||||
"""Inherit competency_area from the between-questions.
|
||||
|
||||
All published questions in the same (track, topic) bucket should agree on
|
||||
competency_area (it's a topic-level invariant). We pick from the first
|
||||
between question; if they disagree, prefer the lower-level one (since the
|
||||
gap is bridging upward from it) and warn the caller.
|
||||
"""
|
||||
for qid in gap.get("between", []):
|
||||
q = corpus.get(qid)
|
||||
if q and q.get("competency_area"):
|
||||
return q["competency_area"]
|
||||
return None
|
||||
|
||||
|
||||
def process_gap(
|
||||
gap: dict,
|
||||
corpus: dict[str, dict],
|
||||
next_ids: dict[str, int],
|
||||
output_dir: Path,
|
||||
*,
|
||||
dry_run: bool = False,
|
||||
) -> dict[str, Any]:
|
||||
"""Returns a one-row report describing the outcome."""
|
||||
track = gap.get("track")
|
||||
if not track or track not in next_ids:
|
||||
next_ids[track] = 0
|
||||
seq = next_ids[track]
|
||||
qid = f"{track}-{seq:04d}"
|
||||
next_ids[track] = seq + 1
|
||||
|
||||
between = [corpus[q] for q in gap.get("between", []) if q in corpus]
|
||||
if len(between) < 1:
|
||||
return {"qid": qid, "ok": False, "why": "no between-questions found in corpus",
|
||||
"gap": gap}
|
||||
|
||||
competency = resolve_competency_area(gap, corpus)
|
||||
if not competency:
|
||||
return {"qid": qid, "ok": False, "why": "could not resolve competency_area",
|
||||
"gap": gap}
|
||||
|
||||
exemplars = find_exemplars(
|
||||
corpus,
|
||||
track=track,
|
||||
topic=gap["topic"],
|
||||
target_level=gap["missing_level"],
|
||||
skip_ids=set(gap.get("between", [])),
|
||||
limit=3,
|
||||
)
|
||||
|
||||
prompt = build_prompt(gap, between, exemplars)
|
||||
if dry_run:
|
||||
return {"qid": qid, "ok": True, "dry_run": True,
|
||||
"prompt_chars": len(prompt),
|
||||
"exemplars": [e["id"] for e in exemplars]}
|
||||
|
||||
response = call_gemini(prompt)
|
||||
if response is None:
|
||||
return {"qid": qid, "ok": False, "why": "no/unparsable Gemini response", "gap": gap}
|
||||
|
||||
gap_with_area = dict(gap)
|
||||
gap_with_area["_competency_area"] = competency
|
||||
draft = assemble_draft(gap_with_area, response, qid)
|
||||
|
||||
ok, why = schema_validate(draft)
|
||||
if not ok:
|
||||
return {"qid": qid, "ok": False, "why": f"schema: {why[:300]}",
|
||||
"gap": gap, "draft": draft}
|
||||
|
||||
target = write_draft(draft, output_dir)
|
||||
return {"qid": qid, "ok": True,
|
||||
"path": str(target.relative_to(REPO_ROOT)),
|
||||
"title": draft["title"],
|
||||
"level": draft["level"],
|
||||
"competency_area": draft["competency_area"]}
|
||||
|
||||
|
||||
def select_gaps(args: argparse.Namespace) -> list[dict]:
|
||||
if args.gap_index is not None:
|
||||
all_gaps = json.loads(Path(args.gaps_from or DEFAULT_GAPS).read_text(encoding="utf-8"))
|
||||
return [all_gaps[args.gap_index]]
|
||||
gaps_path = Path(args.gaps_from or DEFAULT_GAPS)
|
||||
all_gaps = json.loads(gaps_path.read_text(encoding="utf-8"))
|
||||
return all_gaps[: args.limit] if args.limit else all_gaps
|
||||
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser(description=__doc__)
|
||||
ap.add_argument("--gaps-from", type=Path,
|
||||
help=f"path to gaps JSON (default {DEFAULT_GAPS})")
|
||||
ap.add_argument("--gap-index", type=int,
|
||||
help="process a single gap entry by 0-based index")
|
||||
ap.add_argument("--limit", type=int, default=None,
|
||||
help="process at most N gaps from the file")
|
||||
ap.add_argument("--output-dir", type=Path, default=QUESTIONS_DIR,
|
||||
help=f"target tree (default {QUESTIONS_DIR})")
|
||||
ap.add_argument("--dry-run", action="store_true",
|
||||
help="resolve gaps + build prompts, but don't call Gemini")
|
||||
args = ap.parse_args()
|
||||
|
||||
corpus = load_corpus_index()
|
||||
existing_drafts = list(args.output_dir.rglob("*.yaml.draft"))
|
||||
next_ids = next_ids_per_track(corpus, existing_drafts)
|
||||
print(f"corpus: {len(corpus)} questions; "
|
||||
f"existing drafts: {len(existing_drafts)}")
|
||||
print(f"next-id allocator: {dict(sorted(next_ids.items()))}")
|
||||
|
||||
gaps = select_gaps(args)
|
||||
print(f"processing {len(gaps)} gap(s)")
|
||||
|
||||
results: list[dict[str, Any]] = []
|
||||
for i, gap in enumerate(gaps):
|
||||
print(f"\n[{i+1}/{len(gaps)}] {gap.get('track')}/{gap.get('topic')} "
|
||||
f"L?→{gap.get('missing_level')} between={gap.get('between')}")
|
||||
if i > 0 and not args.dry_run:
|
||||
time.sleep(INTER_CALL_DELAY_S)
|
||||
r = process_gap(gap, corpus, next_ids, args.output_dir, dry_run=args.dry_run)
|
||||
results.append(r)
|
||||
if r.get("ok"):
|
||||
print(f" ✓ {r['qid']}: {r.get('path') or '(dry-run)'}")
|
||||
else:
|
||||
print(f" ✗ {r['qid']}: {r.get('why')}")
|
||||
|
||||
n_ok = sum(1 for r in results if r.get("ok"))
|
||||
print(f"\nDONE: {n_ok}/{len(results)} draft(s) written successfully")
|
||||
return 0 if n_ok > 0 or args.dry_run else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
492
interviews/vault-cli/scripts/validate_drafts.py
Executable file
492
interviews/vault-cli/scripts/validate_drafts.py
Executable file
@@ -0,0 +1,492 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Validate Gemini-authored draft questions (Phase 3.b).
|
||||
|
||||
For each ``*.yaml.draft`` under interviews/vault/questions/, run a
|
||||
multi-gate scorecard:
|
||||
|
||||
1. schema — Pydantic Question model (same gate as published)
|
||||
2. originality — cosine vs nearest neighbour in the same (track, topic);
|
||||
reject if any neighbour exceeds the threshold (default 0.92)
|
||||
3. level_fit — Gemini-judge: "does this question's cognitive load match
|
||||
level=<L>?", calibrated against ≤5 existing L-level
|
||||
questions in the same topic.
|
||||
4. coherence — Gemini-judge: "are scenario / question /
|
||||
realistic_solution mutually consistent?"
|
||||
5. bridge — Gemini-judge: "does this question pedagogically chain
|
||||
between <between[0]> and <between[1]> from the gap?"
|
||||
|
||||
A draft passes when **all** gates return "yes" (or skipped). Output:
|
||||
|
||||
- per-draft scorecard rows in interviews/vault/draft-validation-scorecard.json
|
||||
- stdout summary: pass/fail counts + per-gate failure reasons
|
||||
|
||||
Use case: pilot run lands ~30 drafts in the tree; this script tells the
|
||||
human reviewer which to look at first (passes) vs which to discard
|
||||
(failed bridge / failed coherence).
|
||||
|
||||
The originality gate needs an embedding model. By default it loads
|
||||
BAAI/bge-small-en-v1.5 (the same model used for the corpus's
|
||||
embeddings.npz) so cosine values are directly comparable. Pass
|
||||
``--no-originality`` to skip if the model load is undesirable.
|
||||
|
||||
The LLM-judge gates need ``gemini`` on PATH (gemini-3.1-pro-preview).
|
||||
Pass ``--no-llm-judge`` to skip those gates and only run schema +
|
||||
originality.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import yaml
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[3]
|
||||
VAULT_DIR = REPO_ROOT / "interviews" / "vault"
|
||||
QUESTIONS_DIR = VAULT_DIR / "questions"
|
||||
EMBEDDINGS_PATH = VAULT_DIR / "embeddings.npz"
|
||||
DEFAULT_OUTPUT = VAULT_DIR / "draft-validation-scorecard.json"
|
||||
|
||||
GEMINI_MODEL = "gemini-3.1-pro-preview"
|
||||
ORIGINALITY_THRESHOLD = 0.92 # cosine; >= this is "too duplicative"
|
||||
LEVEL_FIT_EXEMPLAR_LIMIT = 5
|
||||
|
||||
try:
|
||||
from vault_cli.models import Question
|
||||
except ImportError:
|
||||
Question = None # type: ignore
|
||||
|
||||
|
||||
# ─── corpus / drafts ──────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def load_yaml(path: Path) -> dict | None:
|
||||
try:
|
||||
with path.open(encoding="utf-8") as f:
|
||||
d = yaml.safe_load(f)
|
||||
except Exception:
|
||||
return None
|
||||
return d if isinstance(d, dict) else None
|
||||
|
||||
|
||||
def load_corpus_index() -> dict[str, dict]:
|
||||
out: dict[str, dict] = {}
|
||||
for path in QUESTIONS_DIR.rglob("*.yaml"):
|
||||
d = load_yaml(path)
|
||||
if d and d.get("id"):
|
||||
out[d["id"]] = d
|
||||
return out
|
||||
|
||||
|
||||
def find_drafts(scope: Path | None = None) -> list[Path]:
|
||||
root = scope or QUESTIONS_DIR
|
||||
return sorted(root.rglob("*.yaml.draft"))
|
||||
|
||||
|
||||
def question_payload(q: dict[str, Any]) -> dict[str, Any]:
|
||||
d = q.get("details") or {}
|
||||
return {
|
||||
"id": q.get("id"),
|
||||
"level": q.get("level"),
|
||||
"title": q.get("title"),
|
||||
"scenario": q.get("scenario"),
|
||||
"question": q.get("question"),
|
||||
"realistic_solution": d.get("realistic_solution"),
|
||||
}
|
||||
|
||||
|
||||
# ─── Gate 1: schema ───────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def gate_schema(draft: dict[str, Any]) -> tuple[bool, str]:
|
||||
if Question is None:
|
||||
return False, "vault_cli not importable; pip install -e interviews/vault-cli/"
|
||||
body = {k: v for k, v in draft.items() if not k.startswith("_")}
|
||||
if isinstance(body.get("details"), dict):
|
||||
body["details"] = {k: v for k, v in body["details"].items() if v is not None}
|
||||
try:
|
||||
Question.model_validate(body)
|
||||
return True, ""
|
||||
except Exception as e:
|
||||
return False, str(e)[:300]
|
||||
|
||||
|
||||
# ─── Gate 2: originality (cosine vs neighbours) ───────────────────────────
|
||||
|
||||
|
||||
_embed_state: dict[str, Any] = {}
|
||||
|
||||
|
||||
def _load_embedding_model_and_corpus():
|
||||
"""Lazy: load BAAI/bge-small-en-v1.5 + corpus vectors once per run."""
|
||||
if "model" in _embed_state:
|
||||
return _embed_state
|
||||
import numpy as np
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
if not EMBEDDINGS_PATH.exists():
|
||||
raise FileNotFoundError(f"missing {EMBEDDINGS_PATH} — needed for originality gate")
|
||||
npz = np.load(EMBEDDINGS_PATH, allow_pickle=True)
|
||||
model_name = str(npz["model_name"])
|
||||
model = SentenceTransformer(model_name)
|
||||
_embed_state.update({
|
||||
"model": model,
|
||||
"model_name": model_name,
|
||||
"vectors": npz["vectors"], # (N, dim) L2-normalised
|
||||
"qids": [str(x) for x in npz["qids"]],
|
||||
"qid_to_row": {str(q): i for i, q in enumerate(npz["qids"])},
|
||||
})
|
||||
return _embed_state
|
||||
|
||||
|
||||
def gate_originality(
|
||||
draft: dict[str, Any],
|
||||
corpus: dict[str, dict],
|
||||
threshold: float = ORIGINALITY_THRESHOLD,
|
||||
) -> tuple[bool, str, dict[str, Any]]:
|
||||
"""Return (ok, reason, detail).
|
||||
|
||||
detail carries the top-1 neighbour qid + cosine, useful for the human
|
||||
reviewer to spot-check against.
|
||||
"""
|
||||
import numpy as np
|
||||
state = _load_embedding_model_and_corpus()
|
||||
model = state["model"]
|
||||
vectors = state["vectors"]
|
||||
qids = state["qids"]
|
||||
qid_to_row = state["qid_to_row"]
|
||||
|
||||
# Embed the draft (concat title + scenario + question — what the v1
|
||||
# corpus embedding script also used for its rows).
|
||||
text = "\n".join([
|
||||
draft.get("title", "") or "",
|
||||
draft.get("scenario", "") or "",
|
||||
draft.get("question", "") or "",
|
||||
])
|
||||
vec = model.encode([text], normalize_embeddings=True)[0]
|
||||
|
||||
# Restrict comparisons to the same (track, topic) bucket — that's
|
||||
# where duplicates would actually matter.
|
||||
track = draft.get("track")
|
||||
topic = draft.get("topic")
|
||||
bucket_qids = [
|
||||
qid for qid, q in corpus.items()
|
||||
if q.get("track") == track and q.get("topic") == topic
|
||||
and qid in qid_to_row
|
||||
]
|
||||
if not bucket_qids:
|
||||
return True, "", {"note": "no in-bucket corpus neighbours; skipping"}
|
||||
|
||||
rows = np.array([qid_to_row[q] for q in bucket_qids], dtype=np.int64)
|
||||
# cosine = dot product since both sides are L2-normalised
|
||||
sims = vectors[rows] @ vec # (len(rows),)
|
||||
top = int(np.argmax(sims))
|
||||
top_qid = bucket_qids[top]
|
||||
top_cos = float(sims[top])
|
||||
|
||||
detail = {"top_neighbour": top_qid, "cosine": round(top_cos, 4),
|
||||
"threshold": threshold, "bucket_size": len(bucket_qids)}
|
||||
if top_cos >= threshold:
|
||||
return False, f"too similar to {top_qid} (cosine={top_cos:.3f} >= {threshold})", detail
|
||||
return True, "", detail
|
||||
|
||||
|
||||
# ─── Gate 3-5: Gemini judges ──────────────────────────────────────────────
|
||||
|
||||
|
||||
def call_gemini_judge(prompt: str, timeout: int = 240) -> dict | None:
|
||||
"""Single judge call; expects strict-JSON {"verdict": "yes|no", "rationale": "..."}."""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["gemini", "-m", GEMINI_MODEL, "-p", prompt, "--yolo"],
|
||||
capture_output=True, text=True, timeout=timeout,
|
||||
)
|
||||
except subprocess.TimeoutExpired:
|
||||
return None
|
||||
out = (result.stdout or "").strip()
|
||||
if out.startswith("```"):
|
||||
out = out.strip("`")
|
||||
if out.startswith("json"):
|
||||
out = out[4:].lstrip()
|
||||
i = out.find("{")
|
||||
j = out.rfind("}")
|
||||
if i == -1 or j == -1:
|
||||
return None
|
||||
try:
|
||||
return json.loads(out[i:j+1])
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
|
||||
|
||||
def _judge_block(draft: dict[str, Any]) -> str:
|
||||
return json.dumps(question_payload(draft), indent=2)
|
||||
|
||||
|
||||
def gate_level_fit(draft: dict, corpus: dict[str, dict]) -> tuple[bool, str, dict]:
|
||||
target_level = draft.get("level")
|
||||
track = draft.get("track")
|
||||
topic = draft.get("topic")
|
||||
exemplars = sorted(
|
||||
[q for q in corpus.values()
|
||||
if q.get("track") == track and q.get("topic") == topic
|
||||
and q.get("level") == target_level
|
||||
and q.get("status") == "published"],
|
||||
key=lambda q: q.get("id", ""),
|
||||
)[:LEVEL_FIT_EXEMPLAR_LIMIT]
|
||||
|
||||
if not exemplars:
|
||||
return True, "", {"note": f"no published L={target_level} exemplars in bucket; skipping"}
|
||||
|
||||
prompt = f"""You are calibrating cognitive load. Given an EXAMPLE PAIR of
|
||||
existing published interview questions at level={target_level} for
|
||||
track={track}, topic={topic}, judge whether the CANDIDATE question
|
||||
matches that level's typical cognitive demand.
|
||||
|
||||
Bloom mapping: L1=remember, L2=understand, L3=apply, L4=analyze,
|
||||
L5=evaluate, L6+=create.
|
||||
|
||||
EXEMPLARS at level={target_level}:
|
||||
{json.dumps([question_payload(q) for q in exemplars], indent=2)}
|
||||
|
||||
CANDIDATE:
|
||||
{_judge_block(draft)}
|
||||
|
||||
Return STRICT JSON with no prose or fences:
|
||||
{{"verdict": "yes" | "no", "rationale": "<one sentence>"}}
|
||||
"""
|
||||
resp = call_gemini_judge(prompt)
|
||||
if resp is None:
|
||||
return False, "no judge response", {}
|
||||
verdict = (resp.get("verdict") or "").strip().lower()
|
||||
if verdict == "yes":
|
||||
return True, "", {"rationale": resp.get("rationale", "")}
|
||||
return False, f"level_fit=no: {resp.get('rationale', '')}", {"rationale": resp.get("rationale")}
|
||||
|
||||
|
||||
def gate_coherence(draft: dict) -> tuple[bool, str, dict]:
|
||||
prompt = f"""Judge whether the scenario, question, and realistic_solution
|
||||
are MUTUALLY CONSISTENT. Specifically:
|
||||
- Does the question logically follow from the scenario?
|
||||
- Does the realistic_solution actually answer the question (not adjacent)?
|
||||
- Are the numbers / system parameters internally consistent across all
|
||||
three fields (no contradictions)?
|
||||
|
||||
CANDIDATE:
|
||||
{_judge_block(draft)}
|
||||
|
||||
Return STRICT JSON with no prose or fences:
|
||||
{{"verdict": "yes" | "no", "rationale": "<one sentence>"}}
|
||||
"""
|
||||
resp = call_gemini_judge(prompt)
|
||||
if resp is None:
|
||||
return False, "no judge response", {}
|
||||
verdict = (resp.get("verdict") or "").strip().lower()
|
||||
if verdict == "yes":
|
||||
return True, "", {"rationale": resp.get("rationale", "")}
|
||||
return False, f"coherence=no: {resp.get('rationale', '')}", {"rationale": resp.get("rationale")}
|
||||
|
||||
|
||||
def gate_bridge(draft: dict, corpus: dict[str, dict]) -> tuple[bool, str, dict]:
|
||||
auth = draft.get("_authoring") or {}
|
||||
gap = auth.get("gap") or {}
|
||||
between_ids = gap.get("between") or []
|
||||
between = [corpus.get(q) for q in between_ids if corpus.get(q)]
|
||||
if len(between) < 2:
|
||||
# Without two between-questions we can't judge a bridge meaningfully.
|
||||
return True, "", {"note": "fewer than 2 between-questions in corpus; skipping"}
|
||||
|
||||
prompt = f"""Judge whether the CANDIDATE question pedagogically chains
|
||||
between the two BETWEEN-questions. Specifically:
|
||||
- Is the candidate's cognitive load above between[0]'s level and at or
|
||||
below between[1]'s level (Bloom progression direction)?
|
||||
- Does the candidate share scenario/concept thread with the between-
|
||||
questions (not introducing a new system)?
|
||||
- Would inserting the candidate between the two existing questions
|
||||
produce a coherent +1 (or +2 last-resort) progression chain?
|
||||
|
||||
BETWEEN[0] (lower):
|
||||
{json.dumps(question_payload(between[0]), indent=2)}
|
||||
|
||||
BETWEEN[1] (higher):
|
||||
{json.dumps(question_payload(between[1]), indent=2)}
|
||||
|
||||
CANDIDATE:
|
||||
{_judge_block(draft)}
|
||||
|
||||
Return STRICT JSON with no prose or fences:
|
||||
{{"verdict": "yes" | "no", "rationale": "<one sentence>"}}
|
||||
"""
|
||||
resp = call_gemini_judge(prompt)
|
||||
if resp is None:
|
||||
return False, "no judge response", {}
|
||||
verdict = (resp.get("verdict") or "").strip().lower()
|
||||
if verdict == "yes":
|
||||
return True, "", {"rationale": resp.get("rationale", "")}
|
||||
return False, f"bridge=no: {resp.get('rationale', '')}", {"rationale": resp.get("rationale")}
|
||||
|
||||
|
||||
# ─── runner ───────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def evaluate_draft(
|
||||
draft_path: Path,
|
||||
corpus: dict[str, dict],
|
||||
args: argparse.Namespace,
|
||||
) -> dict[str, Any]:
|
||||
draft = load_yaml(draft_path)
|
||||
if not draft:
|
||||
return {"path": str(draft_path), "verdict": "fail",
|
||||
"errors": ["could not load YAML"]}
|
||||
|
||||
try:
|
||||
rel_path = str(draft_path.relative_to(REPO_ROOT))
|
||||
except ValueError:
|
||||
rel_path = str(draft_path)
|
||||
rec: dict[str, Any] = {
|
||||
"path": rel_path,
|
||||
"draft_id": draft.get("id"),
|
||||
"track": draft.get("track"),
|
||||
"topic": draft.get("topic"),
|
||||
"level": draft.get("level"),
|
||||
}
|
||||
|
||||
# Gate 1 — schema (mandatory)
|
||||
ok, why = gate_schema(draft)
|
||||
rec["schema_ok"] = ok
|
||||
if not ok:
|
||||
rec["schema_error"] = why
|
||||
rec["verdict"] = "fail"
|
||||
return rec # downstream gates assume a structurally valid YAML
|
||||
|
||||
# Gate 2 — originality
|
||||
if args.no_originality:
|
||||
rec["originality"] = "skipped"
|
||||
else:
|
||||
try:
|
||||
ok, why, detail = gate_originality(draft, corpus, threshold=args.threshold)
|
||||
rec["originality"] = "pass" if ok else "fail"
|
||||
rec["originality_detail"] = detail
|
||||
if not ok:
|
||||
rec["originality_reason"] = why
|
||||
except Exception as e:
|
||||
rec["originality"] = "error"
|
||||
rec["originality_reason"] = str(e)[:200]
|
||||
|
||||
# Gates 3-5 — Gemini judges
|
||||
if args.no_llm_judge:
|
||||
rec["level_fit"] = "skipped"
|
||||
rec["coherence"] = "skipped"
|
||||
rec["bridge"] = "skipped"
|
||||
else:
|
||||
for name, gate in [("level_fit", gate_level_fit),
|
||||
("coherence", gate_coherence),
|
||||
("bridge", gate_bridge)]:
|
||||
try:
|
||||
if name == "coherence":
|
||||
ok, why, detail = gate(draft)
|
||||
else:
|
||||
ok, why, detail = gate(draft, corpus)
|
||||
except Exception as e:
|
||||
rec[name] = "error"
|
||||
rec[f"{name}_reason"] = str(e)[:200]
|
||||
continue
|
||||
rec[name] = "pass" if ok else "fail"
|
||||
rec[f"{name}_detail"] = detail
|
||||
if not ok:
|
||||
rec[f"{name}_reason"] = why
|
||||
time.sleep(args.judge_delay) # be polite between calls
|
||||
|
||||
# Final verdict: pass iff every non-skipped gate is pass.
|
||||
gate_results = [
|
||||
rec.get("originality"),
|
||||
rec.get("level_fit"),
|
||||
rec.get("coherence"),
|
||||
rec.get("bridge"),
|
||||
]
|
||||
has_fail = any(r == "fail" for r in gate_results)
|
||||
has_error = any(r == "error" for r in gate_results)
|
||||
rec["verdict"] = "fail" if has_fail else ("error" if has_error else "pass")
|
||||
return rec
|
||||
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser(description=__doc__)
|
||||
ap.add_argument("--scope", type=Path, default=None,
|
||||
help=f"directory tree to scan for *.yaml.draft "
|
||||
f"(default {QUESTIONS_DIR})")
|
||||
ap.add_argument("--output", type=Path, default=DEFAULT_OUTPUT,
|
||||
help=f"scorecard JSON (default {DEFAULT_OUTPUT})")
|
||||
ap.add_argument("--no-originality", action="store_true",
|
||||
help="skip the embedding-based originality gate")
|
||||
ap.add_argument("--no-llm-judge", action="store_true",
|
||||
help="skip the Gemini-judge gates (level_fit, coherence, bridge)")
|
||||
ap.add_argument("--threshold", type=float, default=ORIGINALITY_THRESHOLD,
|
||||
help=f"originality cosine cutoff (default {ORIGINALITY_THRESHOLD})")
|
||||
ap.add_argument("--judge-delay", type=float, default=4.0,
|
||||
help="seconds between Gemini judge calls (default 4.0)")
|
||||
ap.add_argument("--limit", type=int, default=None,
|
||||
help="evaluate only the first N drafts")
|
||||
args = ap.parse_args()
|
||||
|
||||
drafts = find_drafts(args.scope)
|
||||
if args.limit:
|
||||
drafts = drafts[: args.limit]
|
||||
if not drafts:
|
||||
print(f"no *.yaml.draft files found under {args.scope or QUESTIONS_DIR}")
|
||||
return 0
|
||||
|
||||
corpus = load_corpus_index()
|
||||
print(f"corpus: {len(corpus)} published+draft questions; "
|
||||
f"drafts to evaluate: {len(drafts)}")
|
||||
|
||||
rows: list[dict[str, Any]] = []
|
||||
for i, p in enumerate(drafts, start=1):
|
||||
try:
|
||||
display = p.relative_to(REPO_ROOT)
|
||||
except ValueError:
|
||||
display = p
|
||||
print(f"\n[{i}/{len(drafts)}] {display}")
|
||||
rec = evaluate_draft(p, corpus, args)
|
||||
gate_summary = ", ".join(
|
||||
f"{g}={rec.get(g, '-')}"
|
||||
for g in ("originality", "level_fit", "coherence", "bridge")
|
||||
)
|
||||
print(f" verdict={rec.get('verdict'):4s} {gate_summary}")
|
||||
if rec.get("verdict") == "fail":
|
||||
for k in ("schema_error", "originality_reason",
|
||||
"level_fit_reason", "coherence_reason", "bridge_reason"):
|
||||
if k in rec:
|
||||
print(f" {k}: {str(rec[k])[:200]}")
|
||||
rows.append(rec)
|
||||
|
||||
try:
|
||||
out_display = args.output.relative_to(REPO_ROOT)
|
||||
except ValueError:
|
||||
out_display = args.output
|
||||
args.output.parent.mkdir(parents=True, exist_ok=True)
|
||||
args.output.write_text(json.dumps({
|
||||
"generated_at": datetime.now(timezone.utc).isoformat(timespec="seconds"),
|
||||
"originality_threshold": args.threshold,
|
||||
"drafts_evaluated": len(rows),
|
||||
"passes": sum(1 for r in rows if r.get("verdict") == "pass"),
|
||||
"fails": sum(1 for r in rows if r.get("verdict") == "fail"),
|
||||
"errors": sum(1 for r in rows if r.get("verdict") == "error"),
|
||||
"rows": rows,
|
||||
}, indent=2) + "\n")
|
||||
print(f"\nwrote {out_display}")
|
||||
n_pass = sum(1 for r in rows if r.get("verdict") == "pass")
|
||||
n_fail = sum(1 for r in rows if r.get("verdict") == "fail")
|
||||
n_err = sum(1 for r in rows if r.get("verdict") == "error")
|
||||
print(f"summary: pass={n_pass} fail={n_fail} error={n_err}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
Reference in New Issue
Block a user