Captures the release-readiness state of the vault and the plan for
finishing the audit work the 250/day Gemini cap has constrained.
Corpus health survey (9,446 published questions, no Gemini cost):
- 100% schema-valid (Pydantic)
- 90.9% format-compliant (Pitfall/Rationale/Consequence + Assumptions/
Calculations/Conclusion markers)
- 9.1% fail format compliance (861 questions; mechanical fixes)
- 134 placeholder titles (all global/* "Global New NNNN")
- 407 with provenance: None (should be "imported")
- 95.3% canonical bold-marker napkin_math; 4.7% partial / bullet-only
Template gap noted: vault new scaffolds only scenario + solution stubs;
the Pitfall/Rationale/Consequence and Assumptions/Calculations/Conclusion
templates are encoded ONLY in the generation prompt and the
format-compliance regex. There's no human-readable AUTHORING.md.
The new session is asked to ship one.
The plan: stratified sample of 1,000 questions (33 per track × level
cell) with full Gemini gate suite (math + coherence + level_fit +
bridge) at ~2,900 calls across ~12 days at the 250/day cap. Full-corpus
audit (~27,400 calls / ~110 days) is infeasible; sampling captures any
failure mode at >5-10% rate.
Includes:
- Concrete numbers from the corpus survey (failure counts by category)
- Day-by-day execution plan with resume instructions
- Daily cost-ledger format
- Stopping rules
- Post-audit cleanup → paper.tech update path
- Mechanical (no-Gemini) cleanups the new session can do in parallel
with the daily audit cycle (provenance fix, format markers, AUTHORING.md)
CHAIN_ROADMAP.md Progress Log entry points the resumable cursor at
this plan.
62 KiB
Vault Chain Coverage Roadmap
Status: Phase 1 + 2 complete · Phase 3 pilot landed (4 drafts pending review) · Phase 4 housekeeping mostly shipped
Branch: yaml-audit (off dev)
Worktree: /Users/VJ/GitHub/MLSysBook-yaml-audit
Last updated: 2026-05-02 (Phase 3 batch landed; corpus-wide release audit plan in RELEASE_AUDIT_PLAN.md — fresh session executes)
This document is the canonical resumable plan for the vault chain rebuild
- corpus growth work. Future Claude sessions: read the "Resume Here" section first, then the "Progress Log" to see what was done, then the relevant Phase section for the step you're picking up.
Resume Here (read first)
How to resume in a new session
- Confirm worktree + branch:
pwd # /Users/VJ/GitHub/MLSysBook-yaml-audit git branch # * yaml-audit git log --oneline -5 - Run baseline validators to confirm tree is in known-good state:
vault check --strict # expect: 10,701 loaded, 0 invariant failures vault build --local-json # expect: clean build, releaseId=dev, 9438 published - Read the most recent entry in the Progress Log section below to see what step was last completed, what was decided, and what's next.
- Check for in-flight artifacts (all gitignored under
_pipeline/, seeinterviews/vault/README.md§"Pipeline artifacts" for the convention):ls -la interviews/vault/_pipeline/chains.proposed*.json 2>/dev/null ls -la interviews/vault/_pipeline/gaps.proposed*.json 2>/dev/null ls -la interviews/vault/_pipeline/runs/ 2>/dev/null - Pick up at the next step in the relevant Phase section, follow its detailed substeps, then append a new Progress Log entry when done.
Current state snapshot
- Sidecar architecture: active.
chains.jsonis authoritative; YAMLchains:field stripped from all 10,701 question YAMLs. - Hierarchy: all questions at
interviews/vault/questions/<track>/<area>/<id>.yaml. - Live chain count: 824 (373 primary + 451 secondary; 55 Δ=0 secondary chains dropped 2026-05-02 per audit finding).
- Gap backlog: 138 (strict) + 269 (lenient) = 407 entries across
_pipeline/gaps.proposed.jsonand_pipeline/gaps.proposed.lenient.json(gitignored — seeinterviews/vault/README.md§"Pipeline artifacts"). - Pre-Gemini chains backed up:
interviews/vault/chains.json.bak(726 chains). - Validators: all green as of last commit
1ac7d4c56. - UI tests:
interviews/staffml/tests/chain-and-vault-smoke.mjs— 13/13 pass. - Remote:
origin/yaml-auditpushed through1ac7d4c56. - Dev branch state:
origin/devhas the hierarchy migration merged (commit99b37f021) but NOT the Gemini chain rebuild or sidecar work. Re-merge to dev pending after Phase 1.
What we're building (motivation)
StaffML is an ML systems interview prep platform. Chains are pedagogical progressions through Bloom levels (L1→L2→L3→L4→L5→L6+) within a single topic, where each member builds on the previous. They power:
- Daily challenge sequencing (must show progression, not random)
- "Drill this topic" structured paths
- Mock interview question ordering
373 chains is too sparse: only 54 of 87 topics have any chain at all, and edge/mobile/tinyml tracks have <0.6 chains/topic vs cloud's 3.0. Target: 700-900 chains (~30-40% of corpus chained).
Phase 1 — Second-pass coverage build
Status: complete (2026-04-30)
Goal: double chain count from 373 → 700-800 by relaxing the prompt for
the ~150 buckets that produced 0 chains in the strict pass. Tag these as
tier: secondary so the UI can deprioritize them.
Phase 1 substeps
1.1 — Diagnose uncovered buckets
Deliverable: interviews/vault-cli/scripts/diagnose_chain_coverage.py
What it does:
- Loads YAML corpus (published only) and current
chains.json - For each (track, topic) bucket, reports:
- Number of published questions
- Current chain count
- List of qids
- Output:
interviews/vault/chain-coverage.jsonwith two arrays:uncovered_buckets: ≥3 questions, 0 chainsunder_covered_buckets: ≥6 questions, only 1 chain
- Print summary to stdout: per-track totals, biggest gaps
Validation: Expect ~150 uncovered buckets (sanity check on numbers). Cloud bucket count should be much lower than edge/mobile/tinyml.
Estimated time: 30 min
Files:
- create:
interviews/vault-cli/scripts/diagnose_chain_coverage.py - create:
interviews/vault/chain-coverage.json(gitignored — regeneratable)
Commit message: feat(vault-cli): diagnose_chain_coverage.py — surface buckets needing chains
1.2 — Build relaxed Gemini prompt
Deliverable: --mode lenient flag added to build_chains_with_gemini.py
Prompt changes vs current strict (in PROMPT_TEMPLATE):
LEVEL PROGRESSION RULES (LENIENT MODE):
- Each consecutive pair of members satisfies: cand_level - prev_level ∈ {0, 1, 2, 3}
- Strongly prefer +1; +2 acceptable; +3 only when no intermediate exists
- Δ=0 (same-level pair) IS allowed when both questions clearly share
the same scenario thread but explore different angles of it (e.g.,
diagnosis vs design at the same Bloom level on the same setup).
Do NOT use Δ=0 for unrelated same-level questions.
- Never backward (Δ < 0)
DEFAULT MODE INSTRUCTION:
- Find at least one chain per bucket if any pedagogical clustering
exists. Only return zero chains when the questions are genuinely
unrelated even on the loosest reading.
Other constraints unchanged (single-topic, single-track, 2-6 members, multi-membership cap=2 for L1/L2 anchors).
Code change: Add --mode {strict,lenient} to argparse; lenient
swaps in the LENIENT prompt. Keep strict prompt as default.
Validation: Run on one previously-uncovered bucket as smoke test before the full sweep.
Estimated time: 30 min
Files:
- modify:
interviews/vault-cli/scripts/build_chains_with_gemini.py
Commit message: feat(chains): --mode lenient for second-pass coverage on uncovered buckets
1.3 — Update validators to accept lenient chains + tier metadata
Deliverable: validate_chain in build script + apply_proposed_chains.py
both accept Δ ∈ {0, 1, 2, 3} (Δ=0 only when caller is in lenient mode).
Every chain produced by the lenient sweep is tagged tier: "secondary".
Schema additions:
- Each chain object in chains.proposed.lenient.json gets
"tier": "secondary" apply_proposed_chains.pyvalidation tolerates the new tier field
Validation: New unit test in interviews/vault-cli/tests/test_chain_validation.py
covering Δ=0 acceptance under lenient mode, rejection under strict.
Estimated time: 30 min
Files:
- modify:
interviews/vault-cli/scripts/build_chains_with_gemini.py - modify:
interviews/vault-cli/scripts/apply_proposed_chains.py - create:
interviews/vault-cli/tests/test_chain_validation.py(or add to existing)
Commit message: feat(chains): tier field + lenient-mode Δ=0 acceptance for same-scenario pairs
1.4 — Run second sweep on uncovered buckets only
Deliverable: interviews/vault/chains.proposed.lenient.json with chains
for the ~150 uncovered buckets.
Command:
python3 interviews/vault-cli/scripts/build_chains_with_gemini.py \
--mode lenient \
--buckets-from interviews/vault/chain-coverage.json \
--output interviews/vault/chains.proposed.lenient.json
(May need to add --buckets-from flag to take an explicit bucket list.)
Expected output:
- ~30-40 calls (within 250/day budget)
- ~200-400 chains added
- ~30 min wall time
Process management:
- Launch with
nohupso shell exit doesn't kill it - Monitor via
chains.proposed.lenient.jsonsize growth (incremental persistence) - Check progress:
python3 -c "import json; print(len(json.load(open('interviews/vault/chains.proposed.lenient.json'))))"
Validation:
- Per-track distribution should now be more balanced
- Δ=0 chains should be a meaningful fraction (~10-20%)
- No structural violations on apply
Estimated time: 30 min wall + 10 min review
Files:
- create:
interviews/vault/chains.proposed.lenient.json - update:
interviews/vault/gaps.proposed.json(lenient pass also emits gaps)
Commit message: (no commit yet — staging file, see step 1.5)
1.5 — Merge primary + secondary into chains.json
Deliverable: interviews/vault-cli/scripts/merge_chain_passes.py
Logic:
- Load live
chains.json→ all entries gettier: "primary"if not set - Load
chains.proposed.lenient.json→ entries already havetier: "secondary" - Reject any secondary chain whose qids are already in 2 primary chains (multi-chain cap is 2; can't exceed)
- Reject any secondary chain whose qids are already in 1 primary chain AND the qid is not L1/L2 (cap rule)
- Concatenate accepted chains, sort by chain_id, write
chains.json - Report stats: primary kept, secondary added, secondary rejected (and why)
Validation: Run apply_proposed_chains.py --proposed chains.json --dry-run
against the merged file as final structural gate.
Estimated time: 1 hour
Files:
- create:
interviews/vault-cli/scripts/merge_chain_passes.py - modify:
interviews/vault/chains.json
Commit message: (no commit yet — see 1.6)
1.6 — Rebuild + verify + commit
Steps:
vault check --strict→ expect 0 invariant failuresvault build --local-json→ clean, chainCount jumps from 373 to ~700- Restart staffml dev server, run playwright suite → expect 13/13 (or add tier-related test if scope justifies it)
- Commit Phase 1 work as a single conceptual commit:
Commit message:
feat(vault): Phase 1 — second-pass chain coverage build (~373 → ~700)
Diagnoses uncovered (track, topic) buckets and runs a relaxed Gemini
sweep targeting them. New chains are tier="secondary"; pre-existing
chains stay tier="primary".
Tools:
- diagnose_chain_coverage.py: surface buckets without chains
- build_chains_with_gemini.py: --mode lenient adds Δ=0 + Δ=3 acceptance
- merge_chain_passes.py: merges primary + secondary with cap enforcement
Coverage gains:
- Total chains: 373 -> N (TBD after run)
- Per-track: edge/mobile/tinyml lifted from <0.6 to >1.0 chains/topic
- Topics with ≥1 chain: 54/87 -> M/87 (TBD)
Validation: vault check --strict 0 failures, vault build clean,
playwright UI suite 13/13 pass.
git push origin yaml-audit
Estimated time: 30 min
Phase 1 risks & mitigations
| Risk | Mitigation |
|---|---|
| Gemini at lenient mode hallucinates poor chains | tier=secondary so UI deprioritizes; merge step rejects cap violations |
| Cross-pass duplicate qids | merge_chain_passes explicitly rejects |
| Δ=0 rule too permissive (random same-level pairs accepted) | Prompt restricts to "shared scenario thread"; spot-check 10 random Δ=0 chains post-sweep |
| Daily Gemini budget tight | Phase 1 needs ~30-40 calls; well under 250 |
Phase 2 — Tier surfacing (schema + UI)
Status: complete (2026-05-01)
Goal: chains carry their tier as authoritative metadata; UI prefers
primary chains in default surfaces, exposes secondary in "more paths."
Phase 2 substeps
2.1 — Schema migration
Deliverable: tier field on chain entries is required (default "primary").
Changes:
chains.json: every chain hastier: "primary" | "secondary"interviews/vault-cli/src/vault_cli/validator.py: tier required, default to "primary" if missinginterviews/vault-cli/src/vault_cli/legacy_export.py: include tier in corpus.json output aschain_tiers: {chain_id: "primary"|"secondary"}per question (mirrorschain_positionsshape)
Files:
- modify:
interviews/vault/chains.json(backfill primary tag) - modify:
interviews/vault-cli/src/vault_cli/validator.py - modify:
interviews/vault-cli/src/vault_cli/legacy_export.py
Estimated time: 30 min
Commit message: feat(vault): tier field on chains, derived chain_tiers in corpus.json
2.2 — TypeScript types
Files:
- modify:
interviews/staffml/src/lib/corpus.tsQuestioninterface: addchain_tiers?: Record<string, "primary" | "secondary">ChainInfointerface: addtier: "primary" | "secondary"getChainForQuestionpopulates tiergetAllChainsForQuestionpopulates tier per chain- New:
getPrimaryChainForQuestion(qid)— returns first primary, falls back to first secondary
Estimated time: 15 min
Commit message: (included in 2.3 commit)
2.3 — UI: prefer primary in default surfaces
Files to modify:
interviews/staffml/src/components/ChainStrip.tsx— default to primary; subtle "Alternative path" hint when only secondaryinterviews/staffml/src/app/practice/page.tsx— "next in chain" prefers primaryinterviews/staffml/src/app/explore/page.tsx— filter dropdown "Primary only / All"- Any daily-challenge / mock-interview routing — sequence primary chains only
- URL param:
?chain=<id>already supported via Phase A; verify both tiers reachable
Estimated time: 3-4 hours
Commit message: feat(staffml): UI tier-awareness — primary chains default, secondary opt-in
2.4 — Tests
Files:
- modify:
interviews/staffml/tests/chain-and-vault-smoke.mjs- Add: "primary chain rendered by default when question has both"
- Add: "secondary chain reachable via ?chain= URL param"
- Run full suite → expect 15/15 pass
Estimated time: 1 hour
Commit message: test(staffml): tier-aware playwright cases (primary default, secondary via URL)
2.5 — Push
git push origin yaml-audit
Phase 3 — Gap-driven question authoring
Status: pilot run shipped (3.c + 3.d); 3.e gated on human review of drafts
Goal: Use the 138+ entries in gaps.proposed.json to author new
questions filling missing rungs, validated independently before commit.
This is the durable corpus growth strategy.
Phase 3 substeps
3a — Authoring tool design (1 day)
Deliverable: interviews/vault-cli/scripts/generate_question_for_gap.py
Inputs:
- A single gap entry from
gaps.proposed.json:{ "track": "edge", "topic": "memory-mapped-inference", "missing_level": "L3", "between": ["edge-0220", "edge-0224"], "rationale": "Bridge demand-paging concept to fault-tolerance application" }
What it does:
- Loads the
betweenquestions in full (scenario + question + solution) - Loads 2-3 exemplar questions from same
(track, topic)at the target level - Loads the question Pydantic schema as a textual summary
- Prompts Gemini 3.1 Pro Preview with all the above + instruction to author a question matching the schema, fitting the bridge requirement
- Validates output against Pydantic schema
- Writes to
interviews/vault/questions/<track>/<area>/<auto-id>.yaml.draft(.draftsuffix preventsvault checkfrom loading it as published) - Records authoring metadata: gap source, model, timestamp
Validation: Schema check passes before persisting. Filename matches
<track>-<NNNN>.yaml.draft convention.
Estimated time: 1 day
Files:
- create:
interviews/vault-cli/scripts/generate_question_for_gap.py
Commit message: feat(vault-cli): generate_question_for_gap.py — Gemini-author candidate questions from gaps file
3b — Validation framework (1 day)
Deliverable: interviews/vault-cli/scripts/validate_drafts.py
Checks per draft:
- Schema validation (Pydantic) — same gates as published questions
- Originality: embed the draft + nearest-neighbor cosine in same bucket; reject if cosine > 0.92 (too duplicative of existing)
- Level fit: LLM-judge call — "is this question's cognitive load
consistent with
level=L<N>?" Sample 5 existing L questions in the same topic for calibration. - Scenario coherence: Gemini check — scenario, question, and realistic_solution should be internally consistent
- Bridge check: the new question genuinely chains between the gap's
betweenquestions (LLM-judge with bothbetweenquestions in context)
Output: Scorecard JSON per draft:
{
"draft_id": "edge-2545",
"schema_ok": true,
"originality_cosine": 0.81,
"level_fit": "yes",
"scenario_coherence": "yes",
"bridge_check": "yes",
"verdict": "pass",
"rationale": "..."
}
Estimated time: 1 day
Files:
- create:
interviews/vault-cli/scripts/validate_drafts.py
Commit message: feat(vault-cli): validate_drafts.py — schema + originality + level + coherence + bridge checks
3c — Pilot run on highest-value gaps (1-2 hours)
Steps:
- From
gaps.proposed.json, prioritize:- Gaps where the bucket has 4+ questions already (just missing the bridge)
- Gaps in tracks with low chain density (tinyml, mobile)
- Pick top 30 gaps
- Run
generate_question_for_gap.pyon each → 30 draft files - Run
validate_drafts.py→ expect ~60-75% pass rate - Manual review of passing drafts (~30 min)
Files:
- create:
interviews/vault/questions/<track>/<area>/<auto-id>.yaml.draft× 30 - create:
interviews/vault/draft-validation-scorecard.json
Commit message: (no commit until 3d)
3d — Promote drafts (1 hour)
For each accepted draft:
- Rename
.yaml.draft→.yaml - Add
authoring: { origin: "gemini-3.1-pro-preview", reviewed_by: "<user>", date: "<>" } - Set
status: published(ordraftfor further iteration) - Commit each individually for granular review:
feat(vault): add <id> filling chain gap (track=<>, topic=<>, level=<>)
3e — Re-run chain build to absorb (~30 min)
python3 interviews/vault-cli/scripts/build_chains_with_gemini.py --all \
--output interviews/vault/chains.proposed.json
python3 interviews/vault-cli/scripts/apply_proposed_chains.py
vault check --strict
vault build --local-json
Expected: chain count grows by ~50% of newly authored questions (since they were authored TO fit chains).
3f — Iterate weekly until gap count < threshold
Repeat 3c-3e weekly, tracking metrics:
- Gaps closed per week
- Chains added per week
- Total corpus size
Phase 4 — Other pending items (parallel/ongoing)
These can slot between major phases. Order roughly:
4.1 — Chain audit CI gate
Status: complete 2026-05-01 — 476e9b146. Two CI steps in
staffml-validate-vault.yml: structural audit via
apply_proposed_chains.py --dry-run on chains.json, plus a
per-track count regression guard (cloud ≥ 100 / edge ≥ 25 /
mobile ≥ 25 / tinyml ≥ 25 / global ≥ 0). The roadmap originally
suggested a vault chains audit subcommand; the existing validator
covers it without new CLI surface.
4.2 — Multi-chain UI verification
Status: audited 2026-05-01 — 0 qids in >1 chain (lenient sweep
was scoped to uncovered buckets, so no overlap with primary). Becomes
live once Phase 3 authoring lands; deferred until then.
4.3 — Authoring UX integration
When: after Phase 3
Deliverable: vault new post-write hook calls vault chains suggest to
propose chain memberships for the new question.
4.4 — Deploy pipeline lockstep
When: anytime (independent)
Files: .github/workflows/staffml-publish-live.yml
Change: wait for cloudflare worker release_id match before site deploy.
4.5 — Cross-encoder reranking experiment
When: low priority
Action: re-run interviews/vault-cli/scripts/cross_encoder_rerank_experiment.py
on a beefier machine; OOM'd on 16GB.
4.6 — Periodic chain rebuild automation
Status: complete 2026-05-01 — 03ea7da6b. New workflow at
.github/workflows/staffml-chain-rebuild.yml. Currently
workflow_dispatch-only (cron commented out); flip the cron on once
the rebuild has demonstrated stable diffs against small corpus deltas.
PR body includes per-track delta + proposed/live counts. Apply step
intentionally manual.
4.7 — Chain decay detection
Status: complete 2026-05-01 — 09c04224f. New script
scripts/check_chain_decay.py. Default invocation diffs against
origin/dev, re-embeds changed YAMLs with the same model the corpus
uses, and reports min mate-cosine per chain membership. Threshold 0.40
default; advisory (exit 0) by default; --strict makes it a CI gate.
Intentionally not auto-wired into pre-commit on first ship — model
load is heavy (~135MB first-run download), opt-in is right.
4.8 — Update docs
Status: complete 2026-05-01 — f086b6f42. ARCHITECTURE.md §3.6
captures sidecar + hierarchy + tier; README.md gains a "Chain build
pipeline" section + updated layout/status.
4.9 — gitignore CI guard
Status: complete 2026-05-01 — dbd3d9458. New step in
staffml-validate-vault.yml that hard-fails if any YAML under
interviews/vault/questions/ is gitignored. Implementation: git ls-files --others --ignored --exclude-standard on the questions
tree; non-empty output → fail.
4.10 — Merge yaml-audit → dev (re-merge with chain rebuild)
When: after Phase 1
Action: git merge --no-ff yaml-audit from dev worktree, push.
Triggers staffml-validate-vault.yml on the full corpus including new
sidecar architecture.
Recommended execution order
Week 1
Day 1-2: Phase 1 (1.1 → 1.6)
Day 3: 4.1 (CI gate) + 4.10 (merge to dev)
Day 4-5: Phase 2 (2.1 → 2.5)
Week 2
Day 1: 4.2 (multi-chain verify) + 4.7 (chain decay)
Day 2-5: Phase 3a + 3b (authoring tool + validator)
Week 3
Day 1-2: Phase 3c (pilot batch)
Day 3: Phase 3d-3e (promote + re-chain)
Day 4: 4.6 (periodic rebuild) + 4.4 (deploy lockstep)
Day 5: 4.8 (docs) + 4.9 (gitignore CI)
Week 4+
Phase 3f iterations on remaining gaps
4.3 (authoring UX), 4.5 (cross-encoder)
Progress Log
Append-only. New entries at the bottom. Each step that ships should add a dated entry with: what was done, validation results, commits, notes for the next session.
2026-04-30 — Roadmap document created
What was done:
- Created this roadmap document (
interviews/vault-cli/docs/CHAIN_ROADMAP.md) - Captures full plan: Phase 1-4, recommended execution order, resume instructions
- Initialized Progress Log for append-only step notes
State at this point:
- Branch
yaml-auditat1ac7d4c56(Gemini chain rebuild applied) - 373 chains in
chains.json(sidecar-authoritative, nochains:field in YAMLs) - 138 gaps in
gaps.proposed.json(authoring backlog) - 10,701 YAMLs at hierarchical paths (
<track>/<area>/<id>.yaml) - vault check --strict: 0 failures
- vault build: clean, chainCount=373
- playwright UI suite: 13/13
chains.json.bakis the pre-Gemini 726-chain backup
Files committed in this session (chronological):
aa9373f88paths.py + scripts hierarchy-tolerant2a48177acmigrate 10,701 YAMLs to hierarchical layoutf7d7a328apath-vs-body invariants in vault checkd476b63dfprune stale chains.json entries367cda468rescue 924 yamls from gitignore via/data/anchorefeedb8ccsidecar architecture (chains.json authoritative, strip YAML chains:)8423dcb08build_chains_with_gemini.py + apply_proposed_chains.py0b14e08b5summarize_proposed_chains.pyd8a55f333strict progression rules + multi-chain cap681e40463gap detection + multi-chain UI helpersd82a4f00aGemini CLI exit-1 tolerance + inter-call backoff1ac7d4c56apply 373 chains from Gemini rebuild
Next step: Phase 1.1 — write diagnose_chain_coverage.py to identify
the ~150 buckets that need a second pass.
2026-04-30 — Phase 1.1: diagnose_chain_coverage.py
What was done:
- Added
interviews/vault-cli/scripts/diagnose_chain_coverage.py - Loads published corpus via
vault_cli.policy.is_published(single source of truth — same predicate asvault build) and currentchains.json. - Buckets by
(track, topic); emits per-bucket question_count, chain_count, qids, chain_ids; classifies intouncovered_buckets(≥3 q, 0 chains) andunder_covered_buckets(≥6 q, ≤1 chain). Prints per-track summary + top-10 uncovered for quick read. - Output:
interviews/vault/chain-coverage.json(gitignored — regeneratable). Added/chain-coverage.jsontointerviews/vault/.gitignore.
Validation results (run on tree at 1ac7d4c56 + this commit):
- 313 buckets, 9438 published questions (matches
vault build✓), 373 chains. - Uncovered buckets: 211 (roadmap estimate was ~150). Higher than expected
but same order of magnitude — the gap is mostly in
global(32 uncovered with avg ~6.5 q/bucket) and a long tail of smallmobile/tinymltopics. - Sanity check passes: cloud chain density 2.95/topic vs edge 0.64, mobile 0.74, tinyml 0.80, global 0.00 — matches the "<0.6 vs 3.0" claim.
- Every chain in
chains.jsonlands in a published bucket (0 orphan chain-buckets) — confirms chains.json is consistent with the released corpus. - Notable:
cloud:roofline-analysis(144 questions, 0 chains) is the single largest uncovered bucket — first-pass Gemini sweep missed it entirely despite cloud's high overall coverage. Worth a targeted retry in Phase 1.4.
Per-track table:
track buckets questions chains chains/topic uncov undercov
cloud 82 4028 242 2.95 41 0
edge 76 2077 49 0.64 58 0
global 48 313 0 0.00 32 0
mobile 62 1818 46 0.74 46 1
tinyml 45 1202 36 0.80 34 0
Validators (re-run as a sanity gate on the unmodified corpus):
vault check --strict→ 10,701 loaded, 0 invariant failures ✓vault build --local-json→ releaseId=dev, 9438 published, chainCount=373 ✓
Files committed:
interviews/vault-cli/scripts/diagnose_chain_coverage.py(new)interviews/vault/.gitignore(add/chain-coverage.json)- this Progress Log entry
Notes for next session:
- Higher-than-expected uncovered count (211 vs ~150) means Phase 1.4 Gemini call budget should be re-checked: at 1 call per ~5-7 buckets we may need ~30-45 calls (still under 250/day, but worth monitoring).
- The single
under_coveredbucket (mobile:transformer-systems-cost, 54 questions, 1 chain) is a candidate for a focused retry alongside the lenient sweep. chain-coverage.jsonis regenerated each run; do not check it in. The roadmap step 1.4 will read this file via--buckets-from.
Next step: Phase 1.2 — add --mode lenient to
build_chains_with_gemini.py (relaxed Δ rules + Δ=0 for shared-scenario
pairs).
2026-04-30 — Phase 1.2 + 1.3: --mode lenient + tier field
What was done:
build_chains_with_gemini.py: splitPROMPT_TEMPLATEintoSTRICT_PROMPT_TEMPLATEandLENIENT_PROMPT_TEMPLATE. The lenient template tells Gemini to accept Δ ∈ {0, 1, 2, 3} where Δ=0 is shared-scenario only and Δ=3 is last-resort. SingleMODE_CONFIGdict maps mode → (template, allowed Δ set) sobuild_promptandvalidate_chainstay in lockstep.validate_chainnow takesmode=; rejects Δ=0 / Δ=3 in strict, accepts them in lenient. Both modes still reject backward Δ, multi-topic, and out-of-range size.process_batchtags lenient-mode chains withtier: "secondary"and uses chain_id suffix-secondaryso primary/secondary IDs never collide.- New CLI flags:
--mode {strict,lenient}(default strict);--buckets-from <chain-coverage.json>to restrict the run to the uncovered_buckets list fromdiagnose_chain_coverage.py. apply_proposed_chains.py: docstring note acknowledging tier is intentionally unvalidated (UI hint, not a structural invariant). No logic change — its existing non-strict monotonicity check already accepts Δ=0.tests/test_chain_validation.py: 19 cases covering both modes. Loads the script viaimportlib(it's not in the importablevault_clipackage). All 19 pass.
Smoke check:
--dry-run --buckets-from chain-coverage.json --mode lenient→ 17 calls planned, 211 buckets selected, well under the 200/day cap.- Direct call to
validate_chainon 12 hand-built cases (Δ=0/Δ=3 accept-in-lenient/reject-in-strict, backward, multi-topic, sizes 1 and 7) → 12/12 pass.
Commit: 6cef27ea2 feat(chains): --mode lenient + tier field for second-pass coverage
2026-04-30 — Phase 1.4 + 1.5 + 1.6: lenient sweep, merge, validate
What was done:
- Phase 1.4 — lenient sweep: Ran
build_chains_with_gemini.py --mode lenient --buckets-from chain-coverage.jsonagainst the 211 uncovered buckets. 17 Gemini-3.1-pro-preview calls, ~22 min wall time. Output:chains.proposed.lenient.jsonwith 506 chains (above the 200-400 estimate) andgaps.proposed.lenient.jsonwith 269 new gaps. The validator caught a few cross-bucket and Δ=4 hallucinations in calls 15 + 16 and rejected them inline. - Phase 1.5 — merge: Wrote
interviews/vault-cli/scripts/merge_chain_passes.py. Backfillstier=primaryon existing chains, layers intier=secondarychains, and rejects any secondary that would violate the multi-membership cap (max 2 chains per qid; non-L1/L2 capped at 1). Smoke-tested on 5 synthetic cases — caps enforce as designed. - Phase 1.5 — merge run: 0 cap rejections (expected — the lenient sweep was scoped to uncovered buckets, so secondary qids are by definition fresh). Final merged count: 373 + 506 = 879 chains.
- Phase 1.6 — validate:
apply_proposed_chains.py --proposed chains.json --dry-run→ validation clean (879 chains).vault check --strict→ 10,701 loaded, 0 invariant failures.vault build --local-json→ releaseId=dev, published_count=9438, chainCount=879 (was 373); release_hash changes to04ee8a23….- Started
next devserver; rannode interviews/staffml/tests/chain-and-vault-smoke.mjs→ 13/13 pass. Server stopped post-run.
Distribution & quality checks (lenient pass):
| Δ | count | % | (note) |
|---|---|---|---|
| 0 | 55 | 5.2 | shared-scenario pairs |
| 1 | 736 | 69.1 | preferred shape |
| 2 | 225 | 21.1 | one-rung skip |
| 3 | 49 | 4.6 | last-resort |
- Chains with at least one Δ=0: 55 / 506 (10.9%) — within the roadmap's expected 10-20% band.
- Random spot-check of 5 Δ=0 chains: all show genuine shared-scenario threads (DMA optimization on nRF5340; SRAM OOM on Cortex-M4 residual blocks; CMSIS-NN performance variations; on-device vs cloud routing; PB-scale data pipelines). No "two unrelated L3s glued together" hits in the sample.
Coverage gains:
| track | primary | secondary | total | chains/topic before → after |
|---|---|---|---|---|
| cloud | 242 | 116 | 358 | 2.95 → 4.37 |
| edge | 49 | 148 | 197 | 0.64 → 2.59 |
| mobile | 46 | 113 | 159 | 0.74 → 2.56 |
| tinyml | 36 | 83 | 119 | 0.80 → 2.64 |
| global | 0 | 46 | 46 | 0.00 → 0.96 |
| all | 373 | 506 | 879 | 1.19 → 2.81 |
Buckets with ≥1 chain: 285 / 313 (91%) — was 102 / 313 (33%) before the lenient pass. The 28 remaining un-chained buckets are either tiny (≤2 questions) or the lenient sweep judged them genuinely unrelated.
Files committed in the Phase 1 commit:
interviews/vault-cli/scripts/merge_chain_passes.py(new)interviews/vault/chains.json(373 → 879)interviews/vault/chains.proposed.lenient.json(durable record of the lenient pass)interviews/vault/gaps.proposed.lenient.json(269 new gaps for Phase 3)interviews/staffml/src/data/vault-manifest.json(chainCount + releaseHash)interviews/staffml/src/data/corpus-summary.json(chain memberships per question — derived byvault build)interviews/vault-cli/docs/CHAIN_ROADMAP.md(this Progress Log entry,- status flips: Phase 1 complete, snapshot updated)
Notes for next session:
- Phase 1 done. Phase 2 (tier surfacing in UI) is now unblocked. Start
at 2.1 — schema migration: every chain entry needs
tier, validator should default missing-tier to"primary",legacy_export.pyneeds to emitchain_tiersper question (mirrorschain_positions). - Phase 3 (gap-driven authoring) inherits a much bigger backlog now: 407 gaps total (138 strict + 269 lenient). Prioritize buckets where the bucket already has 4+ questions and just needs the bridge.
- Consider running 4.1 (CI gate) before Phase 2 so any tier-related regressions during 2.x get caught in CI. Roadmap says "before Phase 3 (gates corpus growth)" — could pull it forward.
- Pre-merge backup
chains.json.pre-merge.bakwas deleted; canonical pre-Gemini backup remains atchains.json.bak(the original 726-chain pre-rebuild snapshot).
Next step: Phase 2.1 — schema migration (tier required on chain
entries, chain_tiers derived in legacy_export.py).
2026-05-01 — Phase 2: tier surfacing schema → TS → UI
What was done:
Phase 2.1 — backend / schema:
legacy_export.py: added_build_chain_tier_index(qid → {chain_id: tier}) parallel to the existing_build_chain_index._adaptemits a newchain_tiersfield on every legacy item that haschain_ids, defaulting any missing chain-tier to"primary".vault buildre-run: 2953 chained questions, 2953 carrychain_tiers(100% coverage). releaseHash unchanged from Phase 1 (04ee8a23…) since the new field doesn't perturb the manifest hash inputs.- No validator changes — tier is a UI-routing hint, not a structural invariant. Missing tier defaults to "primary" everywhere.
- Test fixes: existing
test_chain_positions_plural_preservedandtest_multi_chain_membershipwere stale (still asserted on the v1.0 YAMLchains:field path; v1.1 made chains.json the sidecar source so the tests were silently broken). Rewrote to write a chains.json fixture intotmp_pathand addedchain_tiersassertions, plus a newtest_chain_tiers_emitted_per_membershipcovering primary + secondary + missing-tier cases.
Phase 2.2 — TypeScript types:
staffml/src/lib/corpus.ts:Question.chain_tiers?added (optionalRecord<string, "primary" | "secondary">). NewChainTierexported type.ChainInfogains a requiredtierfield.- Internal
_chainTier: Map<chainId, ChainTier>built alongside_chainIndexso the runtime can answer "what tier is this chain?" in O(1) without re-scanning questions. getChainForQuestionandgetAllChainsForQuestionpopulatetieron returned ChainInfo objects.getAllChainsForQuestionnow sorts primary chains first.- New
getPrimaryChainForQuestion(qid): returns the first primary chain, falling back to the first secondary, falling back to null. This is the default-surface helper for UI components. npx tsc --noEmit: 0 errors after the change.
Phase 2.3 — UI:
practice/page.tsx: reads?chain=<id>URL param. UsesgetChainForQuestion(qid, chainParam)when set, otherwisegetPrimaryChainForQuestion(qid). Existing pre-reveal ChainBadge- collapsible ChainStrip rendering paths preserved.
ChainBadge.tsx: added optionaltierprop. Whentier === "secondary", the badge renders an "alt path" pill inline (always-visible — no click required to discover the tier). Default is"primary"so existing call sites don't need updating.ChainStrip.tsx: same "alt path" pill in the progress-dot row when the rendered chain is secondary, for users who do click in.explore/page.tsx: when a question is in multiple chains, the explorer prefers the first non-secondary chain when pickingactiveChainIdfor the related-questions panel.- Deferred from the roadmap's Phase 2.3 scope (tracked for a follow-up):
- "Primary only / All" filter dropdown on the explore page
- Daily-challenge / mock-interview routing changes (those flows don't currently key on chain tier; punted to a focused later commit)
Phase 2.4 — playwright tests:
- Added
test7_tier_aware_chain_routingtochain-and-vault-smoke.mjs. Covers four assertions:- Secondary chain reachable via
?chain=<id>URL param - "alt path" badge visible on the secondary chain
- Primary-chain question still loads (regression check)
- "alt path" badge ABSENT on primary chain (negative check)
- Secondary chain reachable via
- Full suite: 17/17 pass (was 13/13). Roadmap target was 15/15; added one more sub-assertion than planned for the negative check.
- Test fixtures pinned to
cloud-0231(secondary-only) +cloud-chain-auto-secondary-013-04andcloud-0001(primary).
Validators (re-confirmed end of Phase 2):
vault check --strict: 10,701 loaded, 0 invariant failuresvault build --local-json: 9438 published, chainCount=879pytest interviews/vault-cli/tests/: 74/74 passnpx tsc --noEmit: 0 errorsnode interviews/staffml/tests/chain-and-vault-smoke.mjs: 17/17
Notes for next session:
- Phase 2 done. Phase 3 (gap-driven authoring) is unblocked. Backlog for authoring is now 407 gaps (138 strict + 269 lenient).
- The deferred explore-page filter is not load-bearing — secondary
chains are reachable via
?chain=and don't pollute the default surfaces. Worth picking up before Phase 4.x scaffolding. - 0 questions currently belong to BOTH a primary and secondary chain
(because the lenient sweep was scoped to uncovered buckets). When
Phase 3 authors new questions into already-chained buckets, the
cap rules in
merge_chain_passes.pywill start mattering for real. - Consider scheduling a one-time agent to merge
yaml-audit→devagain now that Phase 2 is shipped (the localdevworktree has Phase 1 only — Phase 2 + the CHAIN_ROADMAP updates are not in dev).
Next step: Phase 3.a — generate_question_for_gap.py (Gemini
authoring tool that takes a gap entry and drafts a candidate question
fitting the bridge requirement).
2026-05-01 — Phase 4.8 docs + Phase 4.2 audit
Phase 4.8 — docs (shipped):
interviews/vault/ARCHITECTURE.mdgains a new §3.6 capturing the three v1.1 deltas: hierarchy, sidecar chain registry, tier model. Additive to v1, not replacements; cross-refs CHAIN_ROADMAP.md.interviews/vault-cli/README.md: status line bumped from "Phase 0 scaffolding" to v1.1; new "Chain build pipeline" section with invocation examples for diagnose / build / apply / merge; layout block reflects scripts/ + actual src/ contents.- Commit:
f086b6f42 docs(vault): document v1.1 sidecar + hierarchy + tier model
Phase 4.2 — multi-chain UI audit (no-op for now):
- Audited
chains.json: 0 qids in >1 chain. Reason: the strict pass already enforces the multi-membership cap within-tier, and the lenient pass was scoped to uncovered buckets, so no qid in any primary chain was reachable for a secondary chain to bind to. The merge step's cap rules consequently never fired (0 rejections). - Action: defer the focused playwright test. The case becomes
exercisable when Phase 3 authoring fills bucket gaps and a re-run
of
build_chains_with_gemini.py --all(which will see those new questions in already-chained buckets) produces a multi-chain qid. - No commit needed — zero state change.
Notes for next session:
- Phase 1, Phase 2, and Phase 4.8 are all shipped on
origin/yaml-audit. Localdevworktree has Phase 1 only (Phase 2- docs not re-merged) — the user has been doing parallel workflow
refactoring on dev, so I held off on a second yaml-audit → dev
merge to avoid colliding with their
.github/workflows/edits. When the user is ready, the merge can be done from a clean dev worktree state withgit merge --no-ff yaml-audit.
- docs not re-merged) — the user has been doing parallel workflow
refactoring on dev, so I held off on a second yaml-audit → dev
merge to avoid colliding with their
- Phase 4.1 (CI gate), 4.4 (deploy lockstep), 4.6 (periodic rebuild
automation), and 4.9 (gitignore CI guard) all touch
.github/workflows/— the user has uncommitted changes there, so these were intentionally skipped this session.
Next step: Phase 3.a — generate_question_for_gap.py. This is the
first of the gap-driven authoring tools. The roadmap budgets it at "1
day" because it's the substantive new capability of Phase 3 (Gemini
authoring vs. just chain construction). Best done with the user
available to review the first few generated drafts.
2026-05-01 — Phase 3.a + 3.b: authoring + validation tooling
What was done:
Phase 3.a — generate_question_for_gap.py:
- Reads a gap entry (
{track, topic, missing_level, between, rationale}) from gaps.proposed.json (or .lenient.json), loads the between-questions in full + up to 3 same-bucket exemplars at the target level, prompts Gemini-3.1-pro-preview with the schema summary + bridge context, and writes a candidate question tointerviews/vault/questions/<track>/<area>/<id>.yaml.draft. - ID allocator scans the existing corpus + already-written drafts so a
batch run gets distinct fresh IDs without touching
id-registry.yaml(registry append happens at promotion time, not generation). - Authoring metadata stamped under a private
_authoringblock: origin model, tool name, timestamp, and the source gap entry. The Pydantic Question model hasextra="allow", so this passes schema. - Modes:
--gap-index <N>(single gap),--gaps-from <path> --limit N(batch),--dry-run(build prompts without calling Gemini). - Smoke checks:
--dry-run --gap-index 0resolves the first gap, finds 3 exemplars, builds the prompt, allocatescloud-4579. ✓- Synthetic Gemini response →
assemble_draft→Question.model_validatepasses; YAML preview looks right (12-field body, sensible details). ✓
Phase 3.b — validate_drafts.py:
- Five-gate scorecard per draft:
- schema — Pydantic Question (mandatory; downstream gates skip on schema fail to avoid spurious LLM calls)
- originality — embeds
title + scenario + questionwithBAAI/bge-small-en-v1.5(matches the corpus embeddings.npz model so cosines are directly comparable), compares against in-bucket neighbors, flags anycosine ≥ 0.92 - level_fit — Gemini-judge against ≤5 published exemplars at the target level in the same (track, topic)
- coherence — Gemini-judge: scenario / question / realistic_solution mutually consistent
- bridge — Gemini-judge: candidate genuinely chains between the
two
betweenquestions named in_authoring.gap
- Skips:
--no-originality(skip embed model load),--no-llm-judge(skip Gemini gates). Schema gate is unconditional. - Output:
interviews/vault/draft-validation-scorecard.jsonwith per-row detail + final verdict (pass | fail | error). - Smoke check: synthetic draft in /tmp passed schema + originality (top-neighbor cosine 0.73 vs 0.92 threshold). End-to-end runner produced a well-formed scorecard. ✓
What was deliberately not done tonight:
- Phase 3.c (pilot run on 30 highest-value gaps): This generates new YAML question content that needs human review before promotion. Running 30 unsupervised generations and 30×4 LLM-judge calls without the user available to spot-check the first few outputs is the wrong shape of work for an overnight slot. The tooling is ready when the user is.
- Phase 3.d–3.f: Promotion + re-chain are downstream of 3.c acceptance.
Recommended pilot when the user is back:
- Pick 30 gaps from
gaps.proposed.lenient.jsonwhere the bucket has ≥4 questions already (just missing the bridge):python3 interviews/vault-cli/scripts/generate_question_for_gap.py \ --gaps-from interviews/vault/gaps.proposed.lenient.json \ --limit 30 - Validate:
python3 interviews/vault-cli/scripts/validate_drafts.py - Manually review the passing drafts (~20-25 expected).
- Promote: rename
.yaml.draft→.yaml, append to id-registry. - Re-run
build_chains_with_gemini.py --allso the new questions get absorbed into chains.
Files committed:
interviews/vault-cli/scripts/generate_question_for_gap.py(new)interviews/vault-cli/scripts/validate_drafts.py(new)interviews/vault-cli/docs/CHAIN_ROADMAP.md(this Progress Log entry + status flips)
Notes for next session:
- Both scripts assume
geminiCLI on PATH (gemini-3.1-pro-preview) and, for originality, the corpus'sembeddings.npz(gitignored, regenerable by the existing embedding scripts).validate_drafts --no-llm-judgeis a fast first cut that only exercises schema + originality if you want to triage drafts before paying for the LLM-judge calls. - Heads up: each draft in 3.b consumes ~3 Gemini calls (level_fit + coherence + bridge). 30 drafts → ~90 calls. Daily cap is 250.
id-registry.yamlis append-only and CI-enforced. Promotion (3.d) needs to add new IDs to it; that's not yet wired into a script — manual append for the pilot, then we can extract avault promotehelper from the pattern.
Next step: Phase 3.c — pilot run on 30 high-value gaps (best done with the user available to spot-check the first few outputs).
2026-05-01 — Phase 3.c + 3.d: pilot run + promotion (5 gaps)
Pilot scope (sized down from the roadmap's 30): 5 high-value gaps,
selected from gaps.proposed.lenient.json favoring (track, topic)
buckets with ≥4 published questions and biased toward low-density
tracks. All 5 picks landed in edge/mobile (the densities the lenient
sweep most needed help on).
Phase 3.c — generate (generate_question_for_gap.py):
| target | gap | result |
|---|---|---|
| edge-2535 | edge/latency-decomposition L?→L3 between=[edge-1883, edge-1701] | written |
| edge-2536 | edge/pruning-sparsity L?→L4 between=[edge-1960, edge-1957] | written |
| edge-2537 | edge/tco-cost-modeling L?→L3 between=[edge-0731, edge-1154] | written |
| mobile-2146 | mobile/duty-cycling L?→L3 between=[mobile-0367, mobile-2034] | written |
| mobile-2147 | mobile/model-format-conversion L?→L2 between=[mobile-0984, mobile-1022] | written |
5/5 generated cleanly. Each draft passed Pydantic schema validation
inline (the assemble_draft → Question.model_validate gate); none
were rejected at the file-write step.
Spot-checking edge-2535: realistic ML-systems scenario (Coral USB
TPU + MobileNetV2-SSD + INT8), concrete numbers, calculation-driven
question consistent with L3/apply, solution gets at the actual
insight (host-side bottleneck). Other 4 are similarly competent.
Phase 3.b run — validate_drafts.py:
| draft | originality | level_fit | coherence | bridge | verdict |
|---|---|---|---|---|---|
| edge-2535 | fail (cos=0.933 vs edge-1883) | pass | pass | pass | fail |
| edge-2536 | pass | pass | pass | pass | pass |
| edge-2537 | pass | pass | pass | pass | pass |
| mobile-2146 | pass | pass | pass | pass | pass |
| mobile-2147 | pass | pass | pass | pass | pass |
4/5 pass = 80% pass rate (above the roadmap's 60-75% estimate).
The one fail was correctly caught — edge-2535's draft scenario
turned out too similar to one of its between-questions
(edge-1883), cosine 0.933 over the 0.92 threshold. This is the
gate working as designed: Gemini occasionally drafts a "bridge" that's
just a paraphrase of one of its anchors instead of a true L3
intermediate. The gate filtered it.
Phase 3.d — promotion (4 passing drafts):
.yaml.draft→.yamlrename for the 4 passes._authoringprivate metadata stripped at promotion; replaced with:provenance: llm-draftstatus: draft(notpublished— gating on human review)authors: ["gemini-3.1-pro-preview"]human_reviewed: { status: not-reviewed, ... }so the not-yet-reviewed state is honest and machine-checkable.tags: original tags preserved + a newgap-bridge:<from>-<to>tag so these can be queried later.
- IDs appended to
id-registry.yaml:edge-2536,edge-2537,mobile-2146,mobile-2147— created_bygenerate_question_for_gap.py. edge-2535.yaml.draftwas kept in place (still .yaml.draft). Decision for the human reviewer when they triage: rewrite + retry, or delete.
Validation post-promotion:
vault check --strict→ 10,705 loaded (was 10,701; +4 ✓), 0 invariant failures.vault build --local-json→ released set unchanged: 9438 published, chainCount=879, releaseHash=04ee8a23… (drafts have status=draft, so the publishing filter excludes them — by design).
Phase 3.e — chain rebuild (deferred):
Skipped tonight. The new questions are status: draft and the
chain-builder filters on published, so a rebuild wouldn't pick them
up. The right sequence is: human reviews the 4 drafts → flips status
to published (and human_reviewed.status to verified) → then
re-runs build_chains_with_gemini.py --all. At that point chainCount
is expected to grow modestly (the 4 new questions were authored TO
fit chains, so they should land in their bridge slots).
Files changed in the Phase 3 pilot commit:
interviews/vault/questions/edge/cross-cutting/edge-2537.yaml(new)interviews/vault/questions/edge/optimization/edge-2536.yaml(new)interviews/vault/questions/mobile/deployment/mobile-2147.yaml(new)interviews/vault/questions/mobile/power/mobile-2146.yaml(new)interviews/vault/questions/edge/latency/edge-2535.yaml.draft(new — failed validation, awaiting reviewer disposition)interviews/vault/draft-validation-scorecard.json(new — per-row record)interviews/vault/id-registry.yaml(4 appended entries)interviews/vault-cli/docs/CHAIN_ROADMAP.md(this entry)
Notes for next session — review checklist:
- Read each of the 4 promoted drafts. Spot-checks suggest they're
competent but cognitive-load calibration is the place where Gemini
drift is most likely. Each scorecard row has the
level_fitrationale from the LLM judge — those are first-cut signals, not authoritative. - For the failed
edge-2535: read it next to its high-cosine neighbour (edge-1883). If it's too duplicative as the originality gate suggests, delete; if it's actually distinct enough, edit and re-validate (you can re-runvalidate_drafts.pyafter editing). - Once you're happy with N drafts, flip their
status: draft → publishedandhuman_reviewed.status → verified, sethuman_reviewed.by, then:vault check --strict vault build --local-json # released question count goes up by N python3 interviews/vault-cli/scripts/build_chains_with_gemini.py --all \ --output interviews/vault/chains.proposed.json python3 interviews/vault-cli/scripts/apply_proposed_chains.py - If the pilot's 80% rate holds at scale, a 30-gap batch would land ~24 promotable drafts and absorb ~12-15 of them into chains (chain rebuild typically picks up ~50% of new questions per the roadmap).
Cost note: This pilot used 5 generation calls + 5 × 3 judge calls = 20 Gemini calls. A 30-gap batch would be ~120 calls (still under the 250/day cap but worth budgeting around).
Next step: Phase 3.e — chain rebuild. Gated on human review of the 4 drafts now in the tree.
2026-05-01 — B-track autonomous queue + independent Gemini audit
B-track shipped (7 items + D cleanups):
| commit | what |
|---|---|
188c92b10 |
B1: promote_drafts.py — one-command Phase 3.d helper |
c3a9dfff7 |
B7: PHASE_3_REVIEW_GUIDE.md — review handoff doc |
476e9b146 |
B2 / Phase 4.1 — chain audit + per-track regression guard in staffml-validate-vault.yml |
dbd3d9458 |
B6 / Phase 4.9 — gitignore CI guard for question YAMLs |
03ea7da6b |
B5 / Phase 4.6 — periodic chain rebuild workflow (workflow_dispatch only initially) |
46a02e890 |
B3 / Phase 2.3 deferred — explore-page "Primary chains only / All" filter; playwright 19/19 |
09c04224f |
B4 / Phase 4.7 — check_chain_decay.py advisory script |
6fb1692eb |
D-cleanups — roadmap status header + Phase 3 authoring conventions in ARCHITECTURE.md §3.6.1 |
Independent Gemini audit (5a1cb3d3b script + run results):
Ran audit_chains_with_gemini.py — 18 calls (well under the 250/day Pro
cap), each sized to ~80-336K char prompts (within the attention sweet
spot at 80-100K input tokens). Per-call traces under
interviews/vault/audit-runs/20260501T213817Z/; rollup at
interviews/vault/audit-runs/AUDIT_REPORT.md.
Three critical findings the pipeline's own gates missed:
-
Δ=0 chains are ~98% bad. Of 55 Δ=0 chains in
chains.json, 54 were judgedbadby the independent auditor and 54/55 were judged "shared_scenario_for_d0_pair: no". The lenient prompt was supposed to allow Δ=0 ONLY for shared-scenario pairs, but in practice the constraint didn't bind. Recommended action: demote / drop the Δ=0 chains from chains.json, OR rewrite the lenient prompt to make shared-scenario more rigorously enforced before re-running the lenient sweep on these buckets. -
Gap detection is ~50% noise. Of 40 sampled gaps (across both strict + lenient gap files), 21 (52.5%) were judged "hallucinated" — the two anchor questions don't share a scenario thread for a bridge to make sense. Recommended action: add a pre-filter to
generate_question_for_gap.pythat rejects gaps whose anchors fail a coherence check before issuing a generation call. Saves ~half the Phase 3 generation cost. -
The pilot drafts' pass rate was inflated. Of the 4 promoted drafts (which all passed
validate_drafts.py's LLM judges):mobile-2147: accept ✓edge-2536: edit (scenario truncation)edge-2537: reject (cognitive load too low for L3)mobile-2146: reject (physically absurd 0.5s/4W wake-up)- Action: the existing 4 drafts in the corpus need human review
against this audit (see
AUDIT_REPORT.mdfor the specific issues per draft). Consider rewritingvalidate_drafts.py's coherence judge to specifically look for physical-realism / fabrication failure modes.
Calibrating findings:
- Primary chains (sample of 100): 64% good, 22% weak, 14% bad. Higher "bad" rate than secondary's 6% — partly explained by primary samples skewing higher-level (where the auditor applied stricter judgement), but worth a deliberate split-by-level follow-up.
- Secondary chains (sample of 100): 61% good, 33% weak, 6% bad. The tier delta vs primary at the "good" level is small (61% vs 64%) — the secondary tier isn't dramatically worse, but the lenient sweep's Δ=0 expansion is where the actual quality cliff is.
No autonomous fixes filed — per agreement, audit produces findings only. The user reviews findings and decides which to act on.
Files changed in this entry:
- 8 commits (
188c92b10through6fb1692eb) listed above interviews/vault-cli/scripts/audit_chains_with_gemini.py(new —5a1cb3d3b)interviews/vault/audit-runs/AUDIT_REPORT.md(new)interviews/vault/audit-runs/20260501T213817Z/*.json(per-call traces, 7 files)- this Progress Log entry
Notes for next session — review checklist:
- Read
AUDIT_REPORT.mdstart-to-finish (it's ~3KB, ~5 min). - Decide what to do with Δ=0 chains. They're 55 of 879 chains
(~6% of the corpus). Three options:
a. Drop them entirely from
chains.json(simplest) b. Demote them to atier=experimental(preserves the audit trail) and surface them only via explicit URL c. Re-run the lenient sweep on the 55 source buckets with a tighter prompt (highest cost, but might keep the good ones) - Decide what to do with the 4 pilot drafts. Two reject, one
edit, one accept. Disposition them per
PHASE_3_REVIEW_GUIDE.md's decision tree. - Decide what to do with
generate_question_for_gap.py. The ~50% gap hallucination rate means Phase 3.c at scale (30 gaps) would waste ~15 generation calls + ~45 judge calls. Adding a pre-filter before scaling is high-ROI.
Cost ledger (this whole session):
- Phase 1.4 lenient sweep: 17 calls
- Phase 3 pilot generation + validation: 5 + 15 = 20 calls
- Audit run: 18 calls
- Total: 55 Gemini calls across the session
Next step: user review — three concrete decisions enumerated above. After those land, scaling Phase 3 is the natural next tooling push.
2026-05-02 — Audit finding #1 actioned: dropped Δ=0 chains
What was done:
The 2026-05-01 independent audit's strongest finding was that 54 of 55
Δ=0 chains in chains.json had no shared scenario thread (the
constraint the lenient prompt was supposed to enforce). The conclusion
on second look was that Δ=0 was a flawed design choice, not a bug to
fix:
- The chain definition is "pedagogical progression through Bloom levels"; same-level edges contradict the definition.
- The "shared scenario / different angle" carve-out was unenforceable by an LLM at scale.
- Same-scenario same-level pairs are more honestly modeled as siblings of a chain anchor than as members of a chain.
So this commit doesn't just drop the bad chains — it removes Δ=0 from the lenient mode entirely.
Changes:
interviews/vault/chains.json: 879 → 824 chains. The 55 dropped chains are alltier=secondary(since Δ=0 was only ever produced by the lenient sweep). Per-track impact: edge -19, tinyml -12, mobile -10, cloud -7, global -7.build_chains_with_gemini.py:MODE_CONFIG["lenient"]["allowed_deltas"]:{0,1,2,3}→{1,2,3}LENIENT_PROMPT_TEMPLATE: Δ=0 paragraph rewritten to explicitly REJECT same-level pairs, with a one-line note about why (scenario-thread constraint didn't bind in practice).- docstring +
--modehelp text updated.
tests/test_chain_validation.py:test_lenient_accepts_same_level_pairflipped totest_lenient_rejects_same_level_pair.- Header docstring on test file updated to reflect the new lenient rule, with a 1-line rationale citing the audit.
Validation:
vault check --strict: 10,705 loaded, 0 invariant failuresvault build --local-json: chainCount 879 → 824, releaseHash rolls to479811040b…(was04ee8a23…)pytest interviews/vault-cli/tests/: 74/74 pass- playwright
chain-and-vault-smoke: 19/19 (test fixturescloud-0001andcloud-0231are still in their respective chains post-drop, so the suite required no changes)
What this leaves on the table (audit findings #2 and #3):
- #2 — Gap detection ~50% noise. Still pending. Worth building the gap pre-filter only when scaling Phase 3, which depends on:
- #3 — 4 pilot drafts disposition. 1 accept, 1 edit, 2 reject per
the audit. Needs human review against
PHASE_3_REVIEW_GUIDE.md.
Next step: user disposition of the 4 drafts. After that,
either tighten validate_drafts.py and pre-filter gaps before
scaling Phase 3, or pause the corpus-growth track and pivot.
2026-05-02 — Phase 3 batch + corpus health survey + release audit plan
Phase 3 batch (committed in 924363e2b):
- 30 gaps fed to
generate_question_for_gap.pywith the new pre-filter - Pre-filter caught 21 hallucinated gaps (70% — exactly matching audit-2's measurement)
- 9 drafts generated, 6 promoted (status: published, human_reviewed: verified by vj), 3 rejected (level inflation)
- 5 buckets re-chained in parallel via
--bucketrebuilds - Net: chainCount 824 → 843 (+19), published 9,440 → 9,446 (+6)
- Drive-by: 24 chain_ids renumbered to bucket-tagged form to resolve collisions across
--bucketruns
Merge of origin/dev (a74c98576):
- Brought 73 commits in including dev's CI security fixes, ruff hook for vault-cli, dark-mode kit polish
- Resolved 20 conflicts (vault/* + vault-cli/scripts/* kept ours; .github/workflows/* + tinytorch/* kept theirs)
- Drive-by ruff cleanup: ~40 E701/E702/B007/E722/N806 violations across
scripts/(dev'sb602aa961only cleanedsrc/+tests/)
Corpus health survey (no Gemini cost — regex + structural):
- 9,446 published questions audited
- 90.9% pass format compliance; 9.1% fail (861 questions)
- 134 placeholder titles (
Global New 0006etc., all inglobal/) - 407 with
provenance: None - 447 napkin_math missing one or more bold markers
- 414 common_mistake missing one or more markers (164 missing both)
- 42 solutions read like rubrics
- 100% pass schema (Pydantic)
- Full data:
interviews/vault/_pipeline/format-audit-full.json(gitignored)
Template gap discovered:
vault newscaffolds onlyscenario: <TODO>andrealistic_solution: <TODO>— does NOT include the Pitfall/Rationale/Consequence or Assumptions/Calculations/Conclusion templates- The format conventions are codified ONLY in:
- The
generate_question_for_gap.pySCHEMA_SUMMARY prompt - The
validate_drafts.pygate_format_complianceregex
- The
- There is no human-readable AUTHORING.md. Authors learn the format by osmosis or by reading rejected validations. New session should ship one.
Release audit plan (NEW — interviews/vault-cli/docs/RELEASE_AUDIT_PLAN.md):
Stratified-sample audit (1,000 questions, 33 per (track, level) cell) with
math + coherence + level_fit + bridge gates. Total: ~2,900 Gemini calls
across ~12 days at the 250/day cap. Designed for a fresh session to
execute end-to-end. Includes resume instructions, daily cost ledger
format, and post-audit cleanup → paper.tech update path.
Deliberately NOT shipped this session:
- Full-corpus audit (would be ~27,400 calls / ~110 days — infeasible)
- Re-authoring
edge-2543(content unrecoverable from disk) - Multi-day quota-aware audit script (build it in the new session — recommended extension of
audit_math.py)
Files committed in this entry:
interviews/vault-cli/docs/RELEASE_AUDIT_PLAN.md(new)- This Progress Log entry
Next step: new session — read RELEASE_AUDIT_PLAN.md top-to-bottom, then execute "Resume instructions for the new session" §.