Add vault_invariants.py with 19 structural checks that validate cross-file consistency between corpus, taxonomy, and chains: - Checks 1-14: original structural checks (duplicates, kebab IDs, question counts, prerequisite integrity, cycles, canonical values) - Checks 15-19: gold standard checks (zone coverage, topic coverage, topic concentration, chain levels, validation consistency) Add gate.py for pipeline integration — any script can wrap its work in an InvariantGate context manager to block on regressions. Update WORKFLOW.md to include invariant gate step before commits.
4.9 KiB
StaffML Data Quality Guardrails
How to prevent taxonomy/corpus drift, duplicates, and normalization decay.
The Problem
Issues recur because they enter through four ingestion points, each unguarded:
| Ingestion Point | Script | What Goes Wrong |
|---|---|---|
| Taxonomy extraction | extract_taxonomy.py --merge |
LLM produces Title Case names → creates non-kebab IDs and name duplicates |
| Question generation | generate.py, generate_gaps.py |
LLM uses free-text competency_area instead of canonical enum; primary_concept doesn't match taxonomy |
| Gap filling | vault_fill.py, fill_gaps.sh |
Adds questions tagged to concepts that don't exist in taxonomy |
| Manual edits | Direct corpus.json edits | question_count goes stale; L6 used instead of L6+ |
The Solution: Three Layers
Layer 1: Invariant Checks (catch at commit time)
python3 scripts/vault_invariants.py # 14 checks, exit 1 on FAIL
python3 scripts/vault_invariants.py --fix # Auto-fix what's fixable
python3 scripts/vault_invariants.py --json # Machine-readable for CI
Run this after every pipeline operation. It catches:
- Duplicate concept names/IDs (Check 1-2)
- Non-kebab-case IDs from LLM extraction (Check 3)
- Stale
question_count(Check 4, auto-fixable) - Corpus↔taxonomy concept drift (Check 5, 14)
- Orphan prerequisites (Check 6)
- Graph cycles (Check 7)
- Non-canonical competency areas (Check 8)
- Non-canonical levels like
L6(Check 9, auto-fixable) - Duplicate question IDs (Check 10)
- Broken chain references (Check 11)
- Duplicate titles (Check 12, warn)
- Disconnected singleton concepts (Check 13, warn)
Layer 2: Pipeline Integration (catch at generation time)
Add invariant checks to the existing workflow scripts so issues are caught immediately, not discovered later.
In extract_taxonomy.py --merge: After merging, run checks 1-3 and
reject the merge if new duplicates or non-kebab IDs are introduced.
In generate.py and generate_gaps.py: After generating a batch,
validate that all new questions have canonical competency_area and that
primary_concept exists in taxonomy.json. Reject non-conforming questions
before they enter the corpus.
In vault_fill.py: After filling, run the full invariant suite. If
new FAILs are introduced (compared to the pre-fill baseline), abort.
In scorecard.py: Add invariant check results to the scorecard output
so drift is visible in every health report.
Layer 3: Extraction Hardening (prevent at source)
The root cause of most issues is the LLM extraction prompt producing free-form concept names that don't match existing taxonomy entries.
Fix the extraction prompt to:
- Include the current taxonomy concept list in the prompt context
- Instruct the LLM to reuse existing concept IDs when the concept already exists
- Force kebab-case output format for new concept IDs
- Validate extraction output against the existing taxonomy before merging
Fix the generation prompt to:
- Include
VALID_AREASfromschema.pyin the prompt - Include valid concept IDs from
taxonomy.jsonin the prompt - Reject generated questions that use non-canonical values
Integration Checklist
After implementing the remediation plan:
vault_invariants.pypasses with 0 FAILextract_taxonomy.py --mergeruns invariant checks 1-3 after mergegenerate.pyvalidates competency_area against VALID_AREAS before writingscorecard.pyincludes invariant check summary- WORKFLOW.md updated to include invariant check step
- CI workflow runs
vault_invariants.py --jsonon vault data changes
Suggested CI Workflow
# .github/workflows/vault-validate.yml
name: Vault Data Validation
on:
push:
paths:
- 'interviews/vault/corpus.json'
- 'interviews/vault/taxonomy.json'
- 'interviews/vault/chains.json'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.13'
- run: pip install pydantic
- name: Schema validation
run: |
cd interviews/vault
python3 -c "
import json
from schema import validate_corpus
corpus = json.load(open('corpus.json'))
valid, errors, warnings = validate_corpus(corpus)
print(f'Valid: {len(valid)}, Errors: {len(errors)}, Warnings: {len(warnings)}')
if errors:
for e in errors[:20]:
print(f' ERROR: {e}')
exit(1)
"
- name: Invariant checks
run: |
cd interviews/vault
python3 scripts/vault_invariants.py --json > invariants.json
python3 scripts/vault_invariants.py
The Golden Rule
Every script that writes to corpus.json or taxonomy.json must run
vault_invariants.py before and after, and abort if new FAILs appear.