Sync the yaml-audit branch with the latest dev work since the previous
sync (5c5af75ed). Brings in 73 commits including:
- CI security fixes: postcss XSS bump, uuid bounds bump, codeql
paths-ignore for vendored bundles, read-only token on
staffml-validate-vault workflow
- kits/ dark mode polish: code-block readability, dropdown contrast
- vault-cli/: pre-commit ruff hook + 20 ruff fixes, all-contributors
auto-credit workflow change to pull_request_target
- dev's earlier merge of yaml-audit (836d481b5) carrying the
pre-trailer-strip Phase 1/2/3 history; this merge harmonises that
with the current trailer-clean yaml-audit tip
- misc bug fixes (tinytorch perceptron seed, infra workflows,
socratiq vite dev injector)
Conflicts resolved (if any) preserve the yaml-audit-side authoritative
state for vault/* files (we own those) and the dev-side authoritative
state for .github/workflows/* and other shared infrastructure.
# Conflicts:
# .github/workflows/all-contributors-auto-credit.yml
# .github/workflows/staffml-preview-dev.yml
# interviews/staffml/src/data/corpus-summary.json
# interviews/staffml/src/data/vault-manifest.json
# interviews/staffml/tests/chain-and-vault-smoke.mjs
# interviews/vault-cli/README.md
# interviews/vault-cli/docs/CHAIN_ROADMAP.md
# interviews/vault-cli/scripts/build_chains_with_gemini.py
# interviews/vault-cli/scripts/generate_question_for_gap.py
# interviews/vault-cli/scripts/merge_chain_passes.py
# interviews/vault-cli/scripts/validate_drafts.py
# interviews/vault-cli/src/vault_cli/legacy_export.py
# interviews/vault-cli/tests/test_chain_validation.py
# interviews/vault/.gitignore
# interviews/vault/ARCHITECTURE.md
# interviews/vault/chains.json
# interviews/vault/id-registry.yaml
# interviews/vault/questions/edge/optimization/edge-2536.yaml
# interviews/vault/questions/mobile/deployment/mobile-2147.yaml
# tinytorch/src/03_layers/03_layers.py
The 2026-05-02 audit found ~70% of detected chain gaps are
hallucinated — the two anchor questions don't share a scenario
thread, so a "bridge" between them is fictional. Without this gate,
generating from the existing 407-gap backlog would waste ~75% of the
budget (1 generation call + 3 downstream-judge calls per bad gap).
Adds a 1-call pre-filter via call_gemini_prefilter. The judge sees the
gap entry plus the two anchors in full and returns:
{
"verdict": "real" | "hallucinated",
"anchors_share_scenario": "yes" | "no",
"level_makes_sense": "yes" | "no",
"rationale": "<one sentence>"
}
Hallucinated → process_gap returns ok=False with the prefilter
verdict captured for review. Real → falls through to generation
(unchanged downstream behaviour).
Cost analysis at 70% hallucination rate, 30-gap batch:
Before: 30 generations + 90 judge calls = 120 calls; ~24 promotable drafts
After: 30 prefilter + ~9 generations + 27 judge calls = 66 calls;
~7 promotable drafts (same yield, half the cost)
Skip the pre-filter with --skip-prefilter when re-validating an
already-filtered gap list or for cost-debugging. Default is filter ON.
Smoke checks (mock prefilter responses):
- "real" → process_gap returns ok=True, falls through to generation
- "hallucinated" → ok=False, why="pre-filter: hallucinated gap (...)"
- --skip-prefilter → no pre-filter call, dry_run shows the prompt
Establishes one ignored subdirectory for ALL intermediate outputs of
LLM-driven tooling (chain proposals, gap detection, draft scorecards,
audit traces). Single gitignore rule: /_pipeline/.
Convention is documented in interviews/vault/README.md under "Pipeline
artifacts" — it's a real project layout convention, not AI-specific
config.
Path migration:
interviews/vault/chains.proposed*.json
→ _pipeline/chains.proposed*.json
interviews/vault/gaps.proposed*.json
→ _pipeline/gaps.proposed*.json
interviews/vault/draft-validation-scorecard.json
→ _pipeline/draft-validation-scorecard.json
interviews/vault/audit-runs/
→ _pipeline/runs/
8 scripts updated to define a PIPELINE_DIR constant and route default
outputs through it: build_chains_with_gemini.py,
apply_proposed_chains.py, merge_chain_passes.py, validate_drafts.py,
audit_chains_with_gemini.py, generate_question_for_gap.py,
summarize_proposed_chains.py, promote_drafts.py.
Forward-looking docs (README.md chain-pipeline section + CHAIN_ROADMAP.md
resume instructions + state snapshot) updated to reference the new
paths. Historical Progress Log entries left as-is — they accurately
describe what was committed at the time.
Drive-by .gitignore fixes (both used full repo-relative paths under
package-local .gitignore files, which never matched):
interviews/vault-cli/.gitignore: scripts/.calibration_cache/
interviews/vault/.gitignore: /embeddings.npz
Validation:
- vault check --strict: 10,705 loaded, 0 invariant failures
- pytest interviews/vault-cli/tests/: 74/74
- audit --dry-run: paths resolve correctly to _pipeline/runs/<ts>/
No durable corpus content moves. chains.json (live registry),
id-registry.yaml, questions/, etc. all stay where they were.
Two new scripts that together close the loop from a gap entry to a
reviewable candidate question with a multi-gate scorecard.
generate_question_for_gap.py (3.a):
- Reads a gap entry, loads between-questions + same-bucket exemplars,
prompts gemini-3.1-pro-preview, runs Pydantic Question validation,
and writes <track>/<area>/<id>.yaml.draft. The .draft suffix keeps
drafts out of vault check / vault build until promotion.
- ID allocator scans corpus + existing drafts so a batch run gets
distinct fresh IDs without touching id-registry.yaml.
- Modes: --gap-index, --gaps-from + --limit, --dry-run.
validate_drafts.py (3.b):
- Five gates per draft: schema (Pydantic), originality (cosine vs
in-bucket neighbours via BAAI/bge-small-en-v1.5; matches the corpus
embeddings.npz so values are comparable; cutoff 0.92), level_fit
(Gemini-judge against same-level exemplars), coherence
(Gemini-judge: scenario/question/solution consistency), and bridge
(Gemini-judge: chain-fit between the gap's two anchors).
- Final verdict pass iff every non-skipped gate passes.
- Skips: --no-originality, --no-llm-judge.
- Output: interviews/vault/draft-validation-scorecard.json.
Smoke checks:
- 3.a --dry-run --gap-index 0: resolves gap, builds prompt, allocates
cloud-4579. Synthetic Gemini response Pydantic-validates clean.
- 3.b on a synthetic /tmp draft: schema + originality pass (top
neighbour cosine 0.73 vs 0.92 threshold).
Phase 3.c (pilot run on 30 gaps) deferred: it generates new YAML
question content that needs human review before promotion. The
tooling ships ready; running it is a user-supervised step.
CHAIN_ROADMAP.md Progress Log + Phase 3 status updated.
Two new scripts that together close the loop from a gap entry to a
reviewable candidate question with a multi-gate scorecard.
generate_question_for_gap.py (3.a):
- Reads a gap entry, loads between-questions + same-bucket exemplars,
prompts gemini-3.1-pro-preview, runs Pydantic Question validation,
and writes <track>/<area>/<id>.yaml.draft. The .draft suffix keeps
drafts out of vault check / vault build until promotion.
- ID allocator scans corpus + existing drafts so a batch run gets
distinct fresh IDs without touching id-registry.yaml.
- Modes: --gap-index, --gaps-from + --limit, --dry-run.
validate_drafts.py (3.b):
- Five gates per draft: schema (Pydantic), originality (cosine vs
in-bucket neighbours via BAAI/bge-small-en-v1.5; matches the corpus
embeddings.npz so values are comparable; cutoff 0.92), level_fit
(Gemini-judge against same-level exemplars), coherence
(Gemini-judge: scenario/question/solution consistency), and bridge
(Gemini-judge: chain-fit between the gap's two anchors).
- Final verdict pass iff every non-skipped gate passes.
- Skips: --no-originality, --no-llm-judge.
- Output: interviews/vault/draft-validation-scorecard.json.
Smoke checks:
- 3.a --dry-run --gap-index 0: resolves gap, builds prompt, allocates
cloud-4579. Synthetic Gemini response Pydantic-validates clean.
- 3.b on a synthetic /tmp draft: schema + originality pass (top
neighbour cosine 0.73 vs 0.92 threshold).
Phase 3.c (pilot run on 30 gaps) deferred: it generates new YAML
question content that needs human review before promotion. The
tooling ships ready; running it is a user-supervised step.
CHAIN_ROADMAP.md Progress Log + Phase 3 status updated.