68 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
713d719c3f merge origin/dev into yaml-audit
Brings in the dev-side prose / bib / math fixes that landed since the
yaml-audit branch was cut, and resolves three small conflicts:

* interviews/vault-cli/scripts/archive/split_corpus.py
    origin/dev deleted it (archive cleanup); we honor the deletion.
* interviews/vault-cli/scripts/validate_drafts.py
    origin/dev removed a leftover no-op statement; took theirs.
* interviews/vault-cli/scripts/summarize_proposed_chains.py
    origin/dev renamed loop var lvl→level; took theirs.

The two protected qmds (data_selection.qmd, model_compression.qmd)
are temp-stashed before the merge to honor the 'do not touch' rule;
restored after the merge commit lands.

After this commit, yaml-audit contains every commit on origin/dev as
an ancestor, so dev can fast-forward to yaml-audit's tip when the
maintainer is ready to merge.
2026-05-05 10:03:14 -04:00
Vijay Janapa Reddi
463a180258 fix(vault-cli): _judges adds --skip-trust to gemini invocation
The gemini CLI silently overrides --yolo to default approval mode when
its cwd is not in the trusted-folders list (e.g., a tempfile.gettempdir
scratch dir). The override is logged to stderr as 'Approval mode
overridden to "default" because the current folder is not trusted'
and the call exits 55. --skip-trust opts out of that gate. Verified
2026-05-04 in /tmp/gemini-trust-test.
2026-05-04 10:35:13 -04:00
Vijay Janapa Reddi
d53d2e4b2d fix(vault): resolve metadata gaps + promote 41 audit-clean drafts
Three gap-fixes a corpus audit on 2026-05-04 surfaced:

1. 55 cloud YAMLs were missing the status field entirely; Pydantic
   silently defaulted them to 'draft', so audit_corpus_batched skipped
   them. fix_missing_metadata.py adds explicit
   status: draft + provenance: imported.

2. 59 deleted YAMLs lacked the deletion_reason that the soft-delete
   pairing rule requires. Added placeholder text noting the original
   reason was not preserved on import.

3. The 55 newly-explicit drafts went through a focused vault audit
   (gates: format/level_fit/coherence/math/title). 41 passed all five
   gates and were promoted to status: published. The remaining 14 had
   real issues (13 level_fit / 2 coherence / 1 math) and stay drafts
   for authoring follow-up.

audit_corpus_batched.py now accepts non-published YAMLs when --qids
is explicit (the operator opted in). Default behavior (full-corpus
audit) is unchanged: published-only.

On-disk corpus now: 9,487 published (was 9,446, +41) · 423 drafts
· 386 flagged · 390 deleted · 25 archived · 0 missing-status.
vault check --strict and pytest both clean.
2026-05-04 09:06:43 -04:00
Vijay Janapa Reddi
a84cadc3b8 fix(vault): regenerate marker-compliant cm/nm for 36 published YAMLs
regenerate_format_markers.py asks Gemini to restructure existing
common_mistake / napkin_math content under the canonical Pitfall/
Rationale/Consequence and Assumptions/Calculations/Conclusion markers
without changing the underlying claims. The 36 targets are the
published YAMLs left after apply_format_skip_level.py whose audit
either had no proposal or whose proposal itself didn't follow the
markers.

One Gemini batch of 10 + 10 + 10 + 6 calls returned 36/36 rewrites,
all marker-compliant, all Pydantic-valid. Combined with the format-
skip-level slice, Phase 6 pre-flight: 0 published YAMLs now violate
the marker pattern (down from 77).
2026-05-04 08:35:18 -04:00
Vijay Janapa Reddi
6e788042ae feat(vault-cli): apply_format_skip_level + 41 marker fixes
apply_format_skip_level.py applies marker-compliant common_mistake /
napkin_math corrections for published qids whose proposed fix got
skipped during Phase 5 because the row was entangled with a level
relabel (relabel-up or chain-monotonicity-block) or a high-risk
realistic_solution rewrite. The script applies ONLY the format fields
when the current YAML's value is malformed AND the proposed value
matches the AUTHORING.md markers. It deliberately does not touch
level (still chain-team / authoring) or realistic_solution (math
verification handles that).

Phase 6 pre-flight: a survey on 2026-05-04 found 77 published YAMLs
with malformed markers. This pass fixes 41 of them. Remaining 36
have no marker-compliant proposal in the audit and need a fresh
authoring round before the LinkML pattern can land cleanly.
2026-05-04 08:25:14 -04:00
Vijay Janapa Reddi
3a14b6fbb7 feat(vault-cli): apply_math_skip_level + broaden verify guard
apply_math_skip_level.py is a Phase 5 cleanup helper. For the small set
of qids whose math fix carries a level relabel that's chain-blocked or
relabel-up, the math correction is independently verified and applies
cleanly — only the level relabel is the chain-team / authoring decision.
This script applies napkin_math/realistic_solution/common_mistake while
leaving level untouched, writing a 05_math_skip_level.json sidecar.

verify_math_corrections.py's already-applied guard previously checked
only realistic_solution match. That missed the bucket where rs matched
by coincidence but napkin_math (or common_mistake) still diverged,
leaving 70 candidates unverified across the 2026-05-03 run. The guard
now considers all three math fields.
2026-05-04 08:13:52 -04:00
Vijay Janapa Reddi
04c69e6a5b feat(vault-cli): verify_math_corrections.py — Phase 5 math-fix verifier
Independent Gemini verification pass for the 376 high-risk corrections
that include realistic_solution rewrites (math-driven fixes).

Process:
  1. For each row with a realistic_solution rewrite, build a payload
     with: scenario, question, original solution, proposed napkin_math,
     proposed realistic_solution.
  2. Batch ~10 per call; ask Gemini to RE-DERIVE the answer from the
     scenario as if it hadn't seen the proposed answer, then compare.
  3. Each item gets verdict: yes / no / unclear.
  4. Auto-apply ONLY 'yes' verdicts subject to:
     - Pydantic validation (must pass before write)
     - Chain monotonicity check (level relabels can't break chains)
     - Relabel-up policy (relabel-down only)

Verification prompt explicitly instructs Gemini to default to "unclear"
when uncertain — strict bar for auto-apply.

Outputs:
  03_math_verification.json   per-qid verdict + rationale
  04_math_applied.json        per-qid apply result

Note: forced past .gitignore's `**/VERIFY_*.py` rule (case-insensitive
match on macOS). The rule was for legacy LLM-generated scratch files;
this is intentional production tooling.

CORPUS_HARDENING_PLAN.md Phase 5 — math-fix verification leg.
2026-05-03 19:08:48 -04:00
Vijay Janapa Reddi
15811ef4bc feat(vault-cli): mass_apply_corrections.py — Phase 5 low-risk auto-applier
Automates the safe subset of Phase 5 review work. Reads a 01_audit.json
from a --propose-fixes run and auto-applies LOW-risk corrections
without prompting. HIGH-risk corrections (anything rewriting
realistic_solution) are skipped — those need separate math verification.

Risk classification:
  LOW  : correction touches only ⊆ {title, level, common_mistake, napkin_math}
  HIGH : any correction including realistic_solution

Defensive checks for level relabels (caught real bugs during 2026-05-03
smoke):
  1. Relabel-UP blocked — policy is relabel-down only (§10 Q3).
     Gemini will sometimes propose L3→L4 even with the prompt asking
     for down; we filter regardless.
  2. Chain-monotonicity check — chains.json requires non-decreasing
     levels along chain positions. A relabel that drops a member
     below its predecessor breaks the chain. The check overlays
     prior applies in the same run so cascading same-chain relabels
     don't slip through.

Pydantic validation runs BEFORE writing each YAML; failures don't
write. Atomic temp+rename keeps state consistent under interruption.

Outputs disposition sidecar at <run-dir>/02_mass_apply.json with
per-qid result + reason.

Used to apply 2,075 of 2,381 low-risk corrections from the
2026-05-03 audit dataset (138 chain-monotonicity blocks, 168
relabel-up blocks). 0 Pydantic failures.

CORPUS_HARDENING_PLAN.md Phase 5 — low-risk leg.
2026-05-03 19:06:32 -04:00
Vijay Janapa Reddi
d2621cc9ed feat(vault-cli): merge_audit_runs.py + Phase 4 findings doc
merge_audit_runs.py — merges multiple per-track audit_corpus_batched
output dirs into one canonical run. Per-qid prefer non-error rows,
then rows with suggested_corrections.

AUDIT_FINDINGS_2026-05-03.md — first complete corpus audit.

summarize_audit.py — truncate rationale snippets at word boundaries
(was truncating mid-word, tripping codespell on words like 'claimin').

Phase 4 final stats (9,446 published questions audited):
  format_compliance:   ~960 fail
  level_fit:          ~1,580 fail
  coherence:            ~480 fail
  math_correct:         ~330 fail
  title_quality:        ~250 placeholder + ~25 malformed
  20 error rows in global to retry on next run

1,767 questions have suggested_corrections; ~1,500 more need a
propose-fixes backfill pass (mostly cloud, some edge).

CORPUS_HARDENING_PLAN.md Phase 4 finalization.
2026-05-03 14:26:37 -04:00
Vijay Janapa Reddi
2d9330da67 fix(vault-cli): isolate gemini CLI scratch files in temp dir
The gemini CLI in --yolo mode occasionally writes scratch files
(prompt_candidates.json, audit.py, evaluate_*.py, partial JSON outputs)
to its CWD. When invoked from the repo root those landed alongside the
worktree and polluted git status with ~30 untracked files.

Fix: pass cwd=tempfile.gettempdir()/vault_audit_gemini_scratch to
subprocess.run. The scratch dir is created lazily on import.

This doesn't affect Gemini's output (we capture stdout) or the
prompt (we pass via -p). It just keeps the gemini CLI's incidental
file-system side effects out of the worktree.

CORPUS_HARDENING_PLAN.md Phase 3 (delayed reliability fix).
2026-05-03 11:08:53 -04:00
Vijay Janapa Reddi
3eaac3ca93 feat(vault-cli): summarize_audit.py — Phase 4 finalization helper
Reads a 01_audit.json and emits a markdown triage doc with:
  - executive summary (per-gate pass/fail/error counts)
  - per-track failure rate matrix
  - coherence failure-mode breakdown (vendor fabrication / physical
    absurdity / mismatch / arithmetic)
  - priority lists for human review (math errors highest, then
    coherence by mode, then level inflation, then placeholder titles)
  - regex-vs-Gemini format-disagreement audit
  - recommended next-step actions per category

Usage:
  python3 interviews/vault-cli/scripts/summarize_audit.py \\
      --input <01_audit.json> \\
      --output interviews/vault-cli/docs/AUDIT_FINDINGS_<date>.md

Smoke-tested on the in-flight Phase 4 audit (2,880 cloud rows so far).
Sample findings on the partial data:
  - 95 math errors caught (real ones — e.g. "200 * 4 * 168 = 134,400,
    not $168k/week")
  - 385 level_fit fails (13.4% — slightly higher than global's 15.3%)
  - 146 coherence fails: 61 mismatch / 47 arithmetic / 37 physical
    absurdity / 1 vendor fabrication
  - 118 placeholder titles in cloud (Gemini flags more than just
    the "Global New NNNN" pattern; Phase 7 scope is bigger than the
    plan estimated)

CORPUS_HARDENING_PLAN.md Phase 4.
2026-05-03 11:06:26 -04:00
Vijay Janapa Reddi
1722133faa feat(vault-cli): apply_corrections.py — interactive accept/reject for Gemini-proposed fixes
Phase 5's interactive review tool. Reads a 01_audit.json from a
--propose-fixes run, walks rows with non-empty suggested_corrections,
shows a unified-diff per modified field, and prompts accept/reject/
edit/skip/quit. Validates every accepted body against Pydantic before
writing.

Per CORPUS_HARDENING_PLAN.md correction policy:
  - math errors: rewrite napkin_math AND realistic_solution as a unit
  - level inflation: relabel DOWN, never rewrite up to match
  - format markers: add markers without changing prose semantics

Resumable: dispositions persist to 02_dispositions.json after each
decision; re-running skips already-decided qids. --auto-accept-format
auto-accepts format-marker-only fixes (lower-risk).

Smoke-tested against the in-flight Phase 4 audit: 0 candidates (no
--propose-fixes data yet) and exits clean.

CORPUS_HARDENING_PLAN.md Phase 5.
2026-05-03 11:03:53 -04:00
Vijay Janapa Reddi
1b58a9c508 feat(vault-cli): parallel audit_corpus_batched.py with submit-stagger
Adds ThreadPoolExecutor parallelism to the audit run loop. Without it,
a 9,446-question corpus audit would take ~14h sequential at the
canary-measured ~167s/call rate. With 4-way parallelism + 1s submit
stagger, the same audit fits in ~3-4h.

CLI:
  --workers N             concurrent Gemini calls (default 4, max 8)
  --submit-stagger SECS   sleep between batch submissions (default 1.0)

The submit stagger spreads the worker start times so all N workers
don't slam Gemini in the same instant — correlated rate-limit hits
were a concern and the stagger costs only N seconds at startup.

Concurrency safety:
  - State (rows + seen_qids + persistent file) lives behind _state_lock.
    Mutations + atomic temp+rename writes happen inside the lock.
  - Gemini subprocess calls run OUTSIDE the lock so workers don't block
    each other on the slow path.
  - _print_lock keeps stdout/stderr legible across workers (no
    interleaved lines).
  - normalize_response now drops Gemini-hallucinated qids (returned but
    not in the batch) and warns to stderr.

Validation: smoke-tested on edge track with --max-calls 4 --workers 4.
All 4 batches started in the first 3 seconds (1s stagger ×3); all
finished within 290s (vs ~683s expected sequentially — 2.35× speedup
close to the ideal 4× ceiling). 0 errors, no JSON corruption from
concurrent writes.

The smoke-test results gave us the first edge-track Phase 4 signal:
22.5% level_fit fail rate (vs global's 15.3% — edge has higher
level-inflation than global, worth tracking through Phase 5).

CORPUS_HARDENING_PLAN.md Phase 4.
2026-05-03 09:41:45 -04:00
Vijay Janapa Reddi
12032f700c fix(vault-cli): audit_corpus_batched.py reliability fixes from canary
Three bugs surfaced by the global-track canary run (2026-05-03,
20260503T123116Z), all fixed:

1. Gemini-CLI subprocess timeout was 240s; canary's average call took
   ~167s with 72K-char prompts occasionally exceeding 240s and getting
   killed mid-call. 60 questions (2 batches) returned no Gemini
   response. Bumped default timeout in _judges.call_gemini_judge()
   to 600s (≈3× typical, still triggers fast on a stuck call).

2. Resume logic in run_audit() treated ANY persisted row as "audited,"
   including the placeholder rows for batches that errored. That meant
   re-running on the same output dir would skip the failed batches
   forever. Fixed: only rows with format_compliance != "error" are
   added to seen_qids, so a re-run retries the failures.

3. --output passed as a relative path crashed on
   `outdir.relative_to(REPO_ROOT)` because relative paths don't share
   the absolute REPO_ROOT prefix. Fixed: resolve outdir to absolute
   immediately after computing it.

Validation: re-ran the canary on the same output dir with all three
fixes. Resume correctly skipped the 9 good batches, retried the 2
errored batches, and both completed cleanly in 785s. All 313 global
questions now have real Gemini verdicts (0 errors).

Canary findings:
  format_compliance: 21 fails, 99.6% Gemini-vs-regex agreement
  level_fit:         48 fails (15.3% — the predicted level-inflation
                                pattern; flagged for Phase 5 review)
  coherence:         18 fails
  math_correct:      8 fails
  title_quality:     16 placeholders (matches regex 1:1)

CORPUS_HARDENING_PLAN.md Phase 4 (canary leg).
2026-05-03 09:18:30 -04:00
Vijay Janapa Reddi
69cf6f0a5f feat(vault-cli): audit_corpus_batched.py — full-corpus batched audit
Replaces the dead-end audit_corpus.py (deleted in Phase 0). The new
design batches 30-40 questions per Gemini call instead of 1 question
per gate, dropping the corpus-audit cost by ~10×.

Per call, ONE prompt asks Gemini for a JSON array of per-question
verdicts across:
  - format_compliance:    pass/fail (regex-checkable; cross-checked
                          against host-side gate_format)
  - level_fit:            pass/fail/skip + rationale (level inflation
                          + verb mismatch + "no real judgement required")
  - coherence:            pass/fail + failure_mode (physical_absurdity /
                          vendor_fabrication / mismatch / arithmetic)
  - math_correct:         pass/fail/no_math + specific errors
  - title_quality:        good/placeholder/malformed

Cost (full corpus, 9,446 published):
  - audit-only:        ~315 calls (1.3 days at the 250/day cap)
  - --propose-fixes:   ~+50% (denser per-batch output → smaller batches)

Modes:
  --all                  full corpus (default)
  --tracks cloud,edge    track filter
  --qids X,Y,Z           explicit qid set
  --propose-fixes        ALSO ask Gemini to propose corrections
                         (per CORPUS_HARDENING_PLAN.md §10:
                          - math errors: rewrite napkin_math AND
                            realistic_solution as a UNIT
                          - level inflation: relabel DOWN, never
                            attempt to rewrite the question up)
  --max-calls N          cap per invocation; resume by re-running
  --batch-size N         tuning override
  --dry-run              plan without calling Gemini

Output convention: _pipeline/runs/<UTC-timestamp>/
  00_config.json    — flags, model, candidate count
  01_audit.json     — per-question rows (resumable; rewritten after
                      each batch so a Ctrl-C / timeout doesn't lose work)

Sanity check: dry-run on full corpus packs 9,446 questions into 315
batches of 30, with payloads 55-69KB each (well under the 320KB
attention sweet spot for gemini-3.1-pro-preview).

CORPUS_HARDENING_PLAN.md Phase 3.
2026-05-03 08:22:58 -04:00
Vijay Janapa Reddi
dd71c66cae feat(vault-cli): _judges.py + _batching.py — shared infra for batched audit
Two new helper modules under interviews/vault-cli/scripts/. Used by the
upcoming audit_corpus_batched.py (CORPUS_HARDENING_PLAN.md Phase 3) and
extractable from the existing single-call scripts in a follow-up.

_judges.py exports:
  - GEMINI_MODEL                (pinned)
  - COMMON_MISTAKE_MARKERS      (Pitfall/Rationale/Consequence)
  - NAPKIN_MATH_MARKERS         (Assumptions/Calculations/Conclusion)
  - FAILURE_MODE_TAXONOMY       (4-mode prose block: physical absurdity,
                                 vendor fabrication, mismatch, arithmetic)
  - call_gemini_judge()         (subprocess wrapper + lenient JSON parse)
  - strip_fences()              (response cleanup)
  - gate_format()               (regex format-compliance gate, free)

The taxonomy is the same prose block currently inlined in
validate_drafts.py's COHERENCE_PROMPT and audit_chains_with_gemini.py's
audit prompts. Centralizing it means a future failure-mode addition
flows to every judge, not just one script.

_batching.py exports:
  - MAX_PROMPT_CHARS = 320_000  (≈80K tokens, attention sweet spot)
  - DEFAULT_WRAPPER_CHARS       (4K headroom for prompt scaffolding)
  - pack_batches[T]()           (generic char-budgeted batcher with
                                 optional hard item cap)

Generalized from audit_chains_with_gemini.py:batch_chains and
build_chains_with_gemini.py:plan_batches. Properties documented in the
docstring (preserves order, no items lost, oversized items still land
in a batch).

Followups:
- migrate validate_drafts.py and audit_chains_with_gemini.py to use
  _judges.call_gemini_judge instead of their inlined wrappers (out of
  scope here; non-blocking for the audit work).

CORPUS_HARDENING_PLAN.md Phase 3.
2026-05-03 08:22:39 -04:00
Vijay Janapa Reddi
39d567f267 feat(vault-cli): backfill_provenance.py — Phase 1 helper
Walks vault/questions/**/*.yaml, finds published YAMLs with no top-level
provenance line, and inserts `provenance: imported` on the line
immediately after `status: published`. Idempotent — re-running is a
no-op once the field is present. Limits scope to status: published; the
mechanical pass should not overwrite the semantics of draft / flagged /
deleted / archived questions.

CLI:
  --dry-run     report what would change
  --limit N     cap modifications (smoke test)

CORPUS_HARDENING_PLAN.md Phase 1.
2026-05-03 08:06:12 -04:00
Vijay Janapa Reddi
a74c98576e Merge origin/dev into yaml-audit
Sync the yaml-audit branch with the latest dev work since the previous
sync (5c5af75ed). Brings in 73 commits including:

  - CI security fixes: postcss XSS bump, uuid bounds bump, codeql
    paths-ignore for vendored bundles, read-only token on
    staffml-validate-vault workflow
  - kits/ dark mode polish: code-block readability, dropdown contrast
  - vault-cli/: pre-commit ruff hook + 20 ruff fixes, all-contributors
    auto-credit workflow change to pull_request_target
  - dev's earlier merge of yaml-audit (836d481b5) carrying the
    pre-trailer-strip Phase 1/2/3 history; this merge harmonises that
    with the current trailer-clean yaml-audit tip
  - misc bug fixes (tinytorch perceptron seed, infra workflows,
    socratiq vite dev injector)

Conflicts resolved (if any) preserve the yaml-audit-side authoritative
state for vault/* files (we own those) and the dev-side authoritative
state for .github/workflows/* and other shared infrastructure.

# Conflicts:
#	.github/workflows/all-contributors-auto-credit.yml
#	.github/workflows/staffml-preview-dev.yml
#	interviews/staffml/src/data/corpus-summary.json
#	interviews/staffml/src/data/vault-manifest.json
#	interviews/staffml/tests/chain-and-vault-smoke.mjs
#	interviews/vault-cli/README.md
#	interviews/vault-cli/docs/CHAIN_ROADMAP.md
#	interviews/vault-cli/scripts/build_chains_with_gemini.py
#	interviews/vault-cli/scripts/generate_question_for_gap.py
#	interviews/vault-cli/scripts/merge_chain_passes.py
#	interviews/vault-cli/scripts/validate_drafts.py
#	interviews/vault-cli/src/vault_cli/legacy_export.py
#	interviews/vault-cli/tests/test_chain_validation.py
#	interviews/vault/.gitignore
#	interviews/vault/ARCHITECTURE.md
#	interviews/vault/chains.json
#	interviews/vault/id-registry.yaml
#	interviews/vault/questions/edge/optimization/edge-2536.yaml
#	interviews/vault/questions/mobile/deployment/mobile-2147.yaml
#	tinytorch/src/03_layers/03_layers.py
2026-05-02 11:06:43 -04:00
Vijay Janapa Reddi
615d3484ad fix(vault-cli): audit_math.py — handle output path outside REPO_ROOT
The "wrote {path}" line at end-of-run called Path.relative_to(REPO_ROOT)
unconditionally, which raised when --output was set to a /tmp/ path
(e.g., during smoke-testing). Same fix as validate_drafts.py earlier:
fall back to displaying the absolute path when relative_to fails.

Surfaced while smoke-testing audit_math.py with --output /tmp/...
before pointing it at the real _pipeline/ destination.
2026-05-02 10:53:39 -04:00
Vijay Janapa Reddi
825d9571a6 chore: remove archived content and refresh contributor docs
- Remove retired _archive/ and scripts/archive/ trees (site, book filters, games, vault); vault CHANGELOG points to git history for old scripts.
- CONTRIBUTING: site project row, site/ in area map, root vs TinyTorch pre-commit, vault schema drift wording.
- Newsletter CLI: path-agnostic news alias; tinytorch pre-commit comments; add tools/ and staffml-vault-types READMEs for maintainers.
2026-05-02 10:48:00 -04:00
Vijay Janapa Reddi
cd37a5290c feat(vault-cli): format compliance gate + audit_math.py verifier
Two additions to the Phase 3 verification stack:

1. validate_drafts.py: new gate_format_compliance (Gate 1.5).
   Cheap regex check — no Gemini call. Verifies that the prose-block
   conventions our schema doesn't enforce are present:
     - common_mistake (when present): Pitfall / Rationale / Consequence
     - napkin_math (when present):    Assumptions / Calculations / Conclusion
   Either field is optional in the schema; the gate only flags
   present-but-malformed cases. Smoke-tested against 5 cases (clean,
   missing-pitfall, missing-calculations, no-fields, optional-absent).

2. New scripts/audit_math.py: standalone, focused math verifier.
   For each question, runs ONE Gemini call to re-derive every
   napkin_math calculation from scratch and compare against what's
   written. Returns a verdict on:
     - arithmetic_correct
     - unit_conversions_correct
     - conclusion_follows
     - errors[] (specific issues with quoted lines)
   Use cases: pre-promotion gate on Phase 3 drafts, retroactive
   audit of any subset of the published corpus.
   Internal parallelism via ThreadPoolExecutor (default 4 workers,
   capped at 8 to stay under typical Gemini RPM limits). Modes:
   --drafts-only, --files <paths...>, --sample-track + --sample-size.
2026-05-02 10:10:08 -04:00
Vijay Janapa Reddi
64d546de55 feat(vault-cli): tighten validate_drafts coherence + level_fit gates
The 2026-05-02 audit caught failure modes the existing
validate_drafts.py judges let through: 2 of 4 drafts that all 4 gates
passed (mobile-2146 physical absurdity, edge-2537 cognitive-load
inflation) were rejected by the independent audit. This commit
tightens the coherence and level_fit prompts to catch those modes
explicitly.

gate_coherence — explicit failure-mode taxonomy:
  1. PHYSICAL ABSURDITY: numbers violating real-world hardware bounds
     (NPU wake-up >50ms, off-class power figures, latency >5× off for
     named hardware, duty-cycling that defeats the use-case).
  2. VENDOR FABRICATION: invented hardware / framework / benchmark
     names. Conservative — only flag clearly invented, not plausible-
     but-unverified.
  3. SCENARIO/Q/SOLUTION MISMATCH: question doesn't follow scenario;
     solution doesn't answer the question; cross-field number
     contradictions.
  4. ARITHMETIC ERRORS in napkin_math.
  Output now includes a "failure_mode" field for the rationale to
  hang on; the verdict is unchanged in shape ("yes"|"no").

gate_level_fit — explicit "level inflation" check:
  - L3+ stamped on a question that's actually L1/L2 (recall + simple
    multiplication with all inputs given) → reject.
  - Verb mismatch (the question's verb is more than 1 Bloom step from
    the level field's expected verb) → reject.
  - L4+ requires real decomposition / root-cause / trade-off; mechanical
    computation with all inputs provided is not L4.

Re-validation against the original Phase 3 pilot drafts (5 calls × 3
gates = 15 Gemini calls):

  mobile-2147   accept  → pass on all 4    ✓ (matches audit "accept")
  edge-2536     accept  → pass on all 4    ✓ (matches audit "edit-then-publish";
                                              80ms→15ms latency edit shipped earlier)
  edge-2537     reject  → fail level_fit   ✓ ("level inflation: simple
                                              arithmetic with all inputs upfront")
  mobile-2146   reject  → fail level_fit   ✓ ("0.5s NPU wake-up physically
                              + coherence    absurd; dashcam idle 75% would
                                              miss accidents")
  edge-2535     reject  → fail originality ✓ (cos=0.933 vs edge-1883;
                              + coherence    coherence now ALSO catches:
                                              "solution doesn't actually
                                              perform the calculation")

100% agreement with the independent audit. No false-positives on the
legitimate drafts.

Cost: 15 Gemini calls for the re-validation. Going forward, each
draft eats 3 judge calls (level_fit + coherence + bridge) — same as
before; the prompts are bigger but the call count is unchanged.
2026-05-02 09:56:16 -04:00
Vijay Janapa Reddi
b84691e440 feat(vault-cli): generate_question_for_gap pre-filter for hallucinated gaps
The 2026-05-02 audit found ~70% of detected chain gaps are
hallucinated — the two anchor questions don't share a scenario
thread, so a "bridge" between them is fictional. Without this gate,
generating from the existing 407-gap backlog would waste ~75% of the
budget (1 generation call + 3 downstream-judge calls per bad gap).

Adds a 1-call pre-filter via call_gemini_prefilter. The judge sees the
gap entry plus the two anchors in full and returns:

  {
    "verdict": "real" | "hallucinated",
    "anchors_share_scenario": "yes" | "no",
    "level_makes_sense": "yes" | "no",
    "rationale": "<one sentence>"
  }

Hallucinated → process_gap returns ok=False with the prefilter
verdict captured for review. Real → falls through to generation
(unchanged downstream behaviour).

Cost analysis at 70% hallucination rate, 30-gap batch:
  Before: 30 generations + 90 judge calls = 120 calls; ~24 promotable drafts
  After:  30 prefilter + ~9 generations + 27 judge calls = 66 calls;
          ~7 promotable drafts (same yield, half the cost)

Skip the pre-filter with --skip-prefilter when re-validating an
already-filtered gap list or for cost-debugging. Default is filter ON.

Smoke checks (mock prefilter responses):
  - "real" → process_gap returns ok=True, falls through to generation
  - "hallucinated" → ok=False, why="pre-filter: hallucinated gap (...)"
  - --skip-prefilter → no pre-filter call, dry_run shows the prompt
2026-05-02 09:49:48 -04:00
Vijay Janapa Reddi
5225059754 fix(vault-cli): clear ruff violations flagged by --all-files sweep
Auto-fix removed extraneous f-string prefixes, unused imports
(re, sys, textwrap, defaultdict), an unused local (qids), and
converted datetime.now(timezone.utc) to datetime.now(UTC) (UP017).
Manual fixes split colon/semicolon one-liners onto separate lines
(E701/E702), renamed unused loop vars (cid, chain_id) with leading
underscores (B007), replaced bare except with except Exception (E722),
and renamed loop var L to level to satisfy N806.
2026-05-02 09:17:15 -04:00
Vijay Janapa Reddi
2b3cf5e1da chore(vault): consolidate AI pipeline artifacts under _pipeline/
Establishes one ignored subdirectory for ALL intermediate outputs of
LLM-driven tooling (chain proposals, gap detection, draft scorecards,
audit traces). Single gitignore rule: /_pipeline/.

Convention is documented in interviews/vault/README.md under "Pipeline
artifacts" — it's a real project layout convention, not AI-specific
config.

Path migration:
  interviews/vault/chains.proposed*.json
                  → _pipeline/chains.proposed*.json
  interviews/vault/gaps.proposed*.json
                  → _pipeline/gaps.proposed*.json
  interviews/vault/draft-validation-scorecard.json
                  → _pipeline/draft-validation-scorecard.json
  interviews/vault/audit-runs/
                  → _pipeline/runs/

8 scripts updated to define a PIPELINE_DIR constant and route default
outputs through it: build_chains_with_gemini.py,
apply_proposed_chains.py, merge_chain_passes.py, validate_drafts.py,
audit_chains_with_gemini.py, generate_question_for_gap.py,
summarize_proposed_chains.py, promote_drafts.py.

Forward-looking docs (README.md chain-pipeline section + CHAIN_ROADMAP.md
resume instructions + state snapshot) updated to reference the new
paths. Historical Progress Log entries left as-is — they accurately
describe what was committed at the time.

Drive-by .gitignore fixes (both used full repo-relative paths under
package-local .gitignore files, which never matched):
  interviews/vault-cli/.gitignore: scripts/.calibration_cache/
  interviews/vault/.gitignore:     /embeddings.npz

Validation:
  - vault check --strict: 10,705 loaded, 0 invariant failures
  - pytest interviews/vault-cli/tests/: 74/74
  - audit --dry-run: paths resolve correctly to _pipeline/runs/<ts>/

No durable corpus content moves. chains.json (live registry),
id-registry.yaml, questions/, etc. all stay where they were.
2026-05-02 09:04:55 -04:00
Vijay Janapa Reddi
270b1a5bd2 fix(vault): drop 55 Δ=0 chains + remove Δ=0 from lenient mode
Action on the strongest finding from the 2026-05-01 independent audit:
54 of 55 Δ=0 chains had no shared scenario (the "two questions
sharing a scenario thread" constraint the lenient prompt was supposed
to enforce). Two independent audit fields agreed (verdict=bad and
shared_scenario=no), so this isn't a tuning question — the design
choice was wrong.

Why remove Δ=0 entirely rather than tighten the prompt:

  - The chain definition is "pedagogical progression through Bloom
    levels"; same-level edges contradict the definition.
  - The "shared scenario / different angle" carve-out is unenforceable
    by an LLM at corpus scale (audit confirmed).
  - Same-scenario same-level pairs are more honestly modeled as
    siblings of a chain anchor, not as chain members.

Changes:
  - chains.json: 879 → 824. Dropped: 55 chains (all tier=secondary,
    since Δ=0 was only ever produced by the lenient sweep).
    Per-track: edge -19, tinyml -12, mobile -10, cloud -7, global -7.
  - build_chains_with_gemini.py:
      MODE_CONFIG["lenient"]["allowed_deltas"]: {0,1,2,3} → {1,2,3}
      LENIENT_PROMPT_TEMPLATE: Δ=0 paragraph rewritten to explicitly
        REJECT same-level pairs (with rationale citing the audit).
      docstring + --mode help text updated.
  - tests/test_chain_validation.py:
      test_lenient_accepts_same_level_pair → test_lenient_rejects_same_level_pair
      header docstring updated to reflect the new rule.
  - vault-manifest.json: chainCount 879 → 824, releaseHash rolls to
    479811040b7a… (real content delta, not a timestamp churn).

Validation:
  - vault check --strict: 10,705 loaded, 0 failures
  - vault build --local-json: chainCount=824, releaseHash=479811040b…
  - pytest: 74/74
  - playwright chain-and-vault-smoke: 19/19 (fixtures cloud-0001 +
    cloud-0231 are still in their chains post-drop)

Audit findings #2 (gap detection ~50% noise) and #3 (4 pilot drafts
disposition) remain open — see CHAIN_ROADMAP.md Progress Log.
2026-05-02 08:51:49 -04:00
Vijay Janapa Reddi
66c10e6f2b feat(vault-cli): audit_chains_with_gemini.py — independent pipeline audit
Single-driver script that runs an independent Gemini audit over the
Phase 1-3 chain pipeline output. Designed as a complementary check to
the pipeline's own validation gates (Pydantic schema, embedding cosine,
multiple LLM judges) — runs an INDEPENDENT model pass over what would
otherwise be human-spot-check territory.

Categories (5 audit + 1 synthesis call = ~18 total calls, well under
the 250/day Pro cap):

  1. drafts        4 Phase 3 promoted drafts: independent quality gate
                   (fabrication, level fit, answer correctness, scenario
                   realism — failure modes the existing judges miss)
  2. secondary     100-chain sample of tier=secondary chains
  3. delta_zero    All 55 Δ=0 chains (highest-risk lenient additions)
                   — verifies the "shared scenario" claim per-pair
  4. primary       100-chain sample of tier=primary chains (regression
                   check on strict-pass quality)
  5. gaps          50-gap sample with the two between-questions in full
                   (real bridge vs hallucination)
  6. synthesis     1 wrap-up call → AUDIT_REPORT.md

A previously-planned tier_compare category was dropped: 0 buckets
carry both primary and secondary chains (the lenient sweep was scoped
to uncovered buckets, by definition disjoint). Per-tier quality is
inferred from categories 2 and 4 by the synthesis call.

Per-call target: ~80K input tokens (320K char prompts) — the attention
sweet spot. Chain payloads at ~2-3K chars each pack ~50 chains into
one such prompt.

Outputs land in interviews/vault/audit-runs/<UTC-timestamp>/
  config.json            — what was sampled, with seed for reproducibility
  0N_<category>.json     — per-call prompt-char count, IDs, raw response
And one human-readable rollup at interviews/vault/AUDIT_REPORT.md.

Modes: --dry-run (plan only), --only <cat>, --skip <cat,...>,
--seed (for reproducible re-runs).

Findings only — never edits chains.json or any question YAML. Issues
surfaced for human review.
2026-05-01 17:38:00 -04:00
Vijay Janapa Reddi
c92effc269 feat(vault-cli): Phase 4.7 — chain decay detection (advisory)
Detects chain members that have drifted semantically away from their
chain mates after an edit. Re-embeds changed YAMLs with the same model
the corpus uses (BAAI/bge-small-en-v1.5) and reports the min cosine to
each chain mate.

Default invocation (advisory):

    python3 scripts/check_chain_decay.py
    # diffs against origin/dev, flags chains with min mate-cosine < 0.40

Other modes:

    --files <a.yaml> <b.yaml>     explicit files instead of git diff
    --base HEAD~5                 different base ref
    --threshold 0.50              tighter cutoff (slow drift detection)
    --strict                      exit non-zero on flag (use as CI gate)

Default is advisory not blocking — first ship intentionally doesn't
fail commits or CI. The threshold 0.40 is calibrated against the
post-Phase-1 corpus; tune as needed once you've seen what real-edit
deltas look like in practice.

Implementation notes:
  - Reuses embeddings.npz for chain-mate vectors (no re-embedding the
    whole corpus per run).
  - Only the changed question gets re-embedded — fast for typical
    PR-sized changes.
  - Skips changed questions that aren't in chains; skips chain
    memberships where the mate isn't in embeddings.npz (e.g., the
    Phase 3 promoted drafts before they hit the next embedding rebuild).

Smoke checks:
  - --base origin/dev finds 4 changed YAMLs (the Phase 3 promoted
    drafts), correctly reports no chain memberships (those questions
    aren't in chains.json yet — by design, gated on human review).
  - --files <cloud-2520.yaml> on a real chain member: cos=0.79 vs
    its L5 mate cloud-2521 (well above 0.40 threshold ✓).
2026-05-01 17:31:30 -04:00
Vijay Janapa Reddi
12b35a0929 feat(vault-cli): promote_drafts.py — one-command Phase 3.d helper
Closes the loop on the pilot pattern from a750ab7bc (manual promotion
inline script). Reads draft-validation-scorecard.json and either
promotes every passing draft (--all-passing) or an explicit list
(--qids edge-2536,edge-2537).

Per draft:
  - strips _authoring private metadata; replaces with proper schema
    fields (provenance, status, authors, human_reviewed, created_at)
  - adds gap-bridge:<lower>-<higher> tag for traceability
  - renames .yaml.draft → .yaml
  - appends id to id-registry.yaml (append-only — preserves the
    CI-enforced ledger contract)

Optional flags:
  --publish        flip status to published (default: keep as draft so
                   the human reviewer's workflow stays explicit)
  --reviewed-by X  set human_reviewed.status=verified, by=X, date=now
                   (implies the reviewer has actually read the drafts)
  --dry-run        preview without writing

Refuses to overwrite a <id>.yaml that already exists. Skips
already-promoted drafts (with a warning) when called with
--all-passing on a scorecard whose drafts have been promoted earlier.

Smoke checks:
  - --all-passing on the existing scorecard correctly identifies all 4
    pilot drafts as already-promoted (they shipped in a750ab7bc).
  - --qids edge-2535 --dry-run on the leftover failed-validation draft
    previews the promotion as expected.
2026-05-01 17:22:45 -04:00
Vijay Janapa Reddi
202397f594 Merge origin/dev into yaml-audit
Pull in the dev work that landed since yaml-audit was last synced:
  - --legacy-json renamed to --local-json (2b381bb949) — script/doc
    updates needed below in this branch
  - CI workflow refactor (validate-dev / validate-vault now reusable)
  - all-contributors automation, gitignore tightening, codespell list
  - PR #1622 navbar URL rewrite for dev preview
  - PR #1619 clone-size refactor, #1618 milestone3 xor fix, #1617
    perceptron seed, #1616 tito status M3
  - Chapter 9 PDF layout refinement
  - assorted staffml/practice fixes (pickRandom deps, GitHub star gate)

This merges the canonical dev state into yaml-audit so subsequent
work continues on top of the freshest base. Conflicts in
practice/page.tsx + corpus.ts + ARCHITECTURE.md resolved to keep both
sides' additive changes (Phase 2 tier work + dev's later refactors).
2026-05-01 17:11:31 -04:00
Vijay Janapa Reddi
836d481b54 Merge branch 'yaml-audit' into dev (Phase 1 + 2 + 3 pilot + 4.8 docs)
Brings the chain corpus growth + tier-aware UI work into dev:

  - Phase 1: chains 373 → 879 (second-pass coverage build, primary +
    secondary tier; bucket coverage 33% → 91%)
  - Phase 2: tier surfacing through schema → TypeScript → UI (primary
    chains default; secondary reachable via ?chain= URL with "alt path"
    badge); 17/17 playwright
  - Phase 3 pilot: 5 gap-driven generations, 4 promoted as drafts
    (status=draft pending human review). edge-2535 left as .yaml.draft
    (failed originality gate).
  - Phase 4.8: ARCHITECTURE.md §3.6 + README "Chain build pipeline"
    section documenting v1.1 sidecar + hierarchy + tier model.

State at merge:
  - vault check --strict: 10,705 loaded (4 new drafts), 0 invariant failures
  - vault build --legacy-json: 9438 published, chainCount=879
    (drafts excluded by status filter — releaseHash unchanged from Phase 1)
  - playwright chain-and-vault-smoke: 17/17 (last yaml-audit run)

Phase 3.e (chain rebuild absorbing the new questions) gated on the
human review of the 4 drafts. Runbook in CHAIN_ROADMAP.md.
2026-05-01 13:39:33 -04:00
Vijay Janapa Reddi
604869b986 feat(vault-cli): Phase 3.a + 3.b — gap-driven authoring tooling
Two new scripts that together close the loop from a gap entry to a
reviewable candidate question with a multi-gate scorecard.

generate_question_for_gap.py (3.a):
  - Reads a gap entry, loads between-questions + same-bucket exemplars,
    prompts gemini-3.1-pro-preview, runs Pydantic Question validation,
    and writes <track>/<area>/<id>.yaml.draft. The .draft suffix keeps
    drafts out of vault check / vault build until promotion.
  - ID allocator scans corpus + existing drafts so a batch run gets
    distinct fresh IDs without touching id-registry.yaml.
  - Modes: --gap-index, --gaps-from + --limit, --dry-run.

validate_drafts.py (3.b):
  - Five gates per draft: schema (Pydantic), originality (cosine vs
    in-bucket neighbours via BAAI/bge-small-en-v1.5; matches the corpus
    embeddings.npz so values are comparable; cutoff 0.92), level_fit
    (Gemini-judge against same-level exemplars), coherence
    (Gemini-judge: scenario/question/solution consistency), and bridge
    (Gemini-judge: chain-fit between the gap's two anchors).
  - Final verdict pass iff every non-skipped gate passes.
  - Skips: --no-originality, --no-llm-judge.
  - Output: interviews/vault/draft-validation-scorecard.json.

Smoke checks:
  - 3.a --dry-run --gap-index 0: resolves gap, builds prompt, allocates
    cloud-4579. Synthetic Gemini response Pydantic-validates clean.
  - 3.b on a synthetic /tmp draft: schema + originality pass (top
    neighbour cosine 0.73 vs 0.92 threshold).

Phase 3.c (pilot run on 30 gaps) deferred: it generates new YAML
question content that needs human review before promotion. The
tooling ships ready; running it is a user-supervised step.

CHAIN_ROADMAP.md Progress Log + Phase 3 status updated.
2026-05-01 11:31:06 -04:00
Vijay Janapa Reddi
4b880ebb1a feat(vault-cli): Phase 3.a + 3.b — gap-driven authoring tooling
Two new scripts that together close the loop from a gap entry to a
reviewable candidate question with a multi-gate scorecard.

generate_question_for_gap.py (3.a):
  - Reads a gap entry, loads between-questions + same-bucket exemplars,
    prompts gemini-3.1-pro-preview, runs Pydantic Question validation,
    and writes <track>/<area>/<id>.yaml.draft. The .draft suffix keeps
    drafts out of vault check / vault build until promotion.
  - ID allocator scans corpus + existing drafts so a batch run gets
    distinct fresh IDs without touching id-registry.yaml.
  - Modes: --gap-index, --gaps-from + --limit, --dry-run.

validate_drafts.py (3.b):
  - Five gates per draft: schema (Pydantic), originality (cosine vs
    in-bucket neighbours via BAAI/bge-small-en-v1.5; matches the corpus
    embeddings.npz so values are comparable; cutoff 0.92), level_fit
    (Gemini-judge against same-level exemplars), coherence
    (Gemini-judge: scenario/question/solution consistency), and bridge
    (Gemini-judge: chain-fit between the gap's two anchors).
  - Final verdict pass iff every non-skipped gate passes.
  - Skips: --no-originality, --no-llm-judge.
  - Output: interviews/vault/draft-validation-scorecard.json.

Smoke checks:
  - 3.a --dry-run --gap-index 0: resolves gap, builds prompt, allocates
    cloud-4579. Synthetic Gemini response Pydantic-validates clean.
  - 3.b on a synthetic /tmp draft: schema + originality pass (top
    neighbour cosine 0.73 vs 0.92 threshold).

Phase 3.c (pilot run on 30 gaps) deferred: it generates new YAML
question content that needs human review before promotion. The
tooling ships ready; running it is a user-supervised step.

CHAIN_ROADMAP.md Progress Log + Phase 3 status updated.
2026-05-01 11:31:06 -04:00
Vijay Janapa Reddi
83fe0f7193 feat(vault): Phase 1 — second-pass chain coverage build (373 → 879)
Diagnoses uncovered (track, topic) buckets and runs a relaxed Gemini
sweep targeting them. New chains tier="secondary"; pre-existing chains
backfilled tier="primary".

Tools (Phases 1.1, 1.2/1.3, 1.5):
  - diagnose_chain_coverage.py: surface buckets with no chains
    (committed earlier on yaml-audit)
  - build_chains_with_gemini.py: --mode lenient adds Δ ∈ {0,1,2,3}
    (committed earlier on yaml-audit)
  - merge_chain_passes.py: merges primary + secondary, enforces the
    multi-membership cap (max 2 chains/qid; non-L1/L2 capped at 1)

Sweep (Phase 1.4):
  - 17 Gemini-3.1-pro-preview calls, ~22 min wall time, 211 buckets
  - 506 chains accepted (above the 200-400 estimate), 269 new gaps
  - validator caught a few cross-bucket and Δ=4 hallucinations inline
  - Δ distribution: Δ=1 69.1%, Δ=2 21.1%, Δ=3 4.6%, Δ=0 5.2%
    (10.9% of chains contain at least one Δ=0 — within target band)
  - random spot-check of 5 Δ=0 chains: all share scenario threads
    (DMA, CMSIS-NN, on-device routing, PB-scale pipelines)

Coverage gains (chains/topic before → after):
  - cloud   2.95 → 4.37   (242 + 116 secondary)
  - edge    0.64 → 2.59   ( 49 + 148 secondary)
  - mobile  0.74 → 2.56   ( 46 + 113 secondary)
  - tinyml  0.80 → 2.64   ( 36 +  83 secondary)
  - global  0.00 → 0.96   (  0 +  46 secondary)
  Buckets with ≥1 chain: 102 / 313 (33%) → 285 / 313 (91%).

Validation:
  - apply_proposed_chains.py --dry-run: validation clean (879 chains)
  - vault check --strict: 10,701 loaded, 0 invariant failures
  - vault build --legacy-json: chainCount 373 → 879, release_hash
    rolled to 04ee8a23…
  - playwright chain-and-vault-smoke.mjs: 13/13 pass

Phase 1 complete. Next: Phase 2 (tier surfacing in staffml UI).
2026-04-30 20:12:27 -04:00
Vijay Janapa Reddi
9e6f87bbd4 feat(vault): Phase 1 — second-pass chain coverage build (373 → 879)
Diagnoses uncovered (track, topic) buckets and runs a relaxed Gemini
sweep targeting them. New chains tier="secondary"; pre-existing chains
backfilled tier="primary".

Tools (Phases 1.1, 1.2/1.3, 1.5):
  - diagnose_chain_coverage.py: surface buckets with no chains
    (committed earlier on yaml-audit)
  - build_chains_with_gemini.py: --mode lenient adds Δ ∈ {0,1,2,3}
    (committed earlier on yaml-audit)
  - merge_chain_passes.py: merges primary + secondary, enforces the
    multi-membership cap (max 2 chains/qid; non-L1/L2 capped at 1)

Sweep (Phase 1.4):
  - 17 Gemini-3.1-pro-preview calls, ~22 min wall time, 211 buckets
  - 506 chains accepted (above the 200-400 estimate), 269 new gaps
  - validator caught a few cross-bucket and Δ=4 hallucinations inline
  - Δ distribution: Δ=1 69.1%, Δ=2 21.1%, Δ=3 4.6%, Δ=0 5.2%
    (10.9% of chains contain at least one Δ=0 — within target band)
  - random spot-check of 5 Δ=0 chains: all share scenario threads
    (DMA, CMSIS-NN, on-device routing, PB-scale pipelines)

Coverage gains (chains/topic before → after):
  - cloud   2.95 → 4.37   (242 + 116 secondary)
  - edge    0.64 → 2.59   ( 49 + 148 secondary)
  - mobile  0.74 → 2.56   ( 46 + 113 secondary)
  - tinyml  0.80 → 2.64   ( 36 +  83 secondary)
  - global  0.00 → 0.96   (  0 +  46 secondary)
  Buckets with ≥1 chain: 102 / 313 (33%) → 285 / 313 (91%).

Validation:
  - apply_proposed_chains.py --dry-run: validation clean (879 chains)
  - vault check --strict: 10,701 loaded, 0 invariant failures
  - vault build --legacy-json: chainCount 373 → 879, release_hash
    rolled to 04ee8a23…
  - playwright chain-and-vault-smoke.mjs: 13/13 pass

Phase 1 complete. Next: Phase 2 (tier surfacing in staffml UI).
2026-04-30 20:12:27 -04:00
Vijay Janapa Reddi
d272d374aa feat(chains): --mode lenient + tier field for second-pass coverage
Phase 1.2 + 1.3 of CHAIN_ROADMAP.md. The two land together because the
prompt template, validator Δ-rule, and tier-tagging must stay in lockstep
or chains.proposed.lenient.json would mis-validate.

build_chains_with_gemini.py:
  - new LENIENT_PROMPT_TEMPLATE alongside renamed STRICT_PROMPT_TEMPLATE;
    lenient template tells Gemini to accept Δ ∈ {0,1,2,3}, with Δ=0 only
    for shared-scenario same-level pairs and Δ=3 last-resort
  - MODE_CONFIG single-source-of-truth maps mode → (template, allowed Δ set)
  - validate_chain now takes mode= and gates on the per-mode Δ set
  - process_batch tags lenient-mode chains with tier="secondary" and
    a chain_id suffix (-secondary) so primary/secondary IDs never collide
  - new --mode {strict,lenient} flag (default strict — primary chains
    keep producing under the same rules as before)
  - new --buckets-from <chain-coverage.json> flag that restricts the run
    to the uncovered_buckets list from diagnose_chain_coverage.py
    (the Phase 1.4 second-pass entry point)

apply_proposed_chains.py:
  - docstring note: tier field is intentionally not validated here
    (it's a UI hint, not a structural invariant)
  - already accepts Δ=0 chains via its non-strict monotonicity check, so
    no logic change needed

tests/test_chain_validation.py:
  - 19 cases covering both modes: strict accepts +1/+2 and rejects Δ=0,
    Δ≥3, and backward; lenient accepts Δ=0/Δ=3 but still rejects Δ≥4 and
    backward; both modes reject size-out-of-range, multi-topic, and
    unknown qids. Loads the script via importlib (it's not part of the
    importable vault_cli package).

Smoke check (--dry-run --buckets-from chain-coverage.json --mode lenient):
17 calls planned for the 211 uncovered buckets, well under the 200 cap.
2026-04-30 19:29:12 -04:00
Vijay Janapa Reddi
4b785ce26d feat(chains): --mode lenient + tier field for second-pass coverage
Phase 1.2 + 1.3 of CHAIN_ROADMAP.md. The two land together because the
prompt template, validator Δ-rule, and tier-tagging must stay in lockstep
or chains.proposed.lenient.json would mis-validate.

build_chains_with_gemini.py:
  - new LENIENT_PROMPT_TEMPLATE alongside renamed STRICT_PROMPT_TEMPLATE;
    lenient template tells Gemini to accept Δ ∈ {0,1,2,3}, with Δ=0 only
    for shared-scenario same-level pairs and Δ=3 last-resort
  - MODE_CONFIG single-source-of-truth maps mode → (template, allowed Δ set)
  - validate_chain now takes mode= and gates on the per-mode Δ set
  - process_batch tags lenient-mode chains with tier="secondary" and
    a chain_id suffix (-secondary) so primary/secondary IDs never collide
  - new --mode {strict,lenient} flag (default strict — primary chains
    keep producing under the same rules as before)
  - new --buckets-from <chain-coverage.json> flag that restricts the run
    to the uncovered_buckets list from diagnose_chain_coverage.py
    (the Phase 1.4 second-pass entry point)

apply_proposed_chains.py:
  - docstring note: tier field is intentionally not validated here
    (it's a UI hint, not a structural invariant)
  - already accepts Δ=0 chains via its non-strict monotonicity check, so
    no logic change needed

tests/test_chain_validation.py:
  - 19 cases covering both modes: strict accepts +1/+2 and rejects Δ=0,
    Δ≥3, and backward; lenient accepts Δ=0/Δ=3 but still rejects Δ≥4 and
    backward; both modes reject size-out-of-range, multi-topic, and
    unknown qids. Loads the script via importlib (it's not part of the
    importable vault_cli package).

Smoke check (--dry-run --buckets-from chain-coverage.json --mode lenient):
17 calls planned for the 211 uncovered buckets, well under the 200 cap.
2026-04-30 19:29:12 -04:00
Vijay Janapa Reddi
b289a5eb75 Merge branch 'yaml-audit' into dev
Brings the vault chain rebuild + sidecar architecture work into dev:

  - Hierarchical question layout (interviews/vault/questions/<track>/<area>/<id>.yaml)
    completed in earlier dev merge; this branch adds the sidecar split
  - chains.json is now the authoritative chain registry; YAML chains: field
    stripped from all 10,701 question files
  - 373 chains rebuilt via Gemini 3.1 Pro Preview with strict progression
    rules (Δ ∈ {1,2}, single-track, single-topic, multi-membership cap=2)
  - 138 gaps surfaced into gaps.proposed.json for Phase 3 authoring
  - Tooling: build_chains_with_gemini.py, apply_proposed_chains.py,
    summarize_proposed_chains.py, diagnose_chain_coverage.py
  - CHAIN_ROADMAP.md captures the resumable Phase 1-4 plan

State at merge:
  - vault check --strict: 10,701 loaded, 0 invariant failures
  - vault build --legacy-json: clean, releaseId=dev, 9438 published, 373 chains
  - playwright UI suite (last run on yaml-audit): 13/13 pass

Phase 1.1 (diagnose_chain_coverage.py) shipped on yaml-audit; Phase
1.2-1.6 (lenient sweep, tier merge) still pending. See CHAIN_ROADMAP.md
Progress Log for the resumable cursor.
2026-04-30 18:39:05 -04:00
Vijay Janapa Reddi
f527c230f3 Merge branch 'yaml-audit' into dev
Brings the vault chain rebuild + sidecar architecture work into dev:

  - Hierarchical question layout (interviews/vault/questions/<track>/<area>/<id>.yaml)
    completed in earlier dev merge; this branch adds the sidecar split
  - chains.json is now the authoritative chain registry; YAML chains: field
    stripped from all 10,701 question files
  - 373 chains rebuilt via Gemini 3.1 Pro Preview with strict progression
    rules (Δ ∈ {1,2}, single-track, single-topic, multi-membership cap=2)
  - 138 gaps surfaced into gaps.proposed.json for Phase 3 authoring
  - Tooling: build_chains_with_gemini.py, apply_proposed_chains.py,
    summarize_proposed_chains.py, diagnose_chain_coverage.py
  - CHAIN_ROADMAP.md captures the resumable Phase 1-4 plan

State at merge:
  - vault check --strict: 10,701 loaded, 0 invariant failures
  - vault build --legacy-json: clean, releaseId=dev, 9438 published, 373 chains
  - playwright UI suite (last run on yaml-audit): 13/13 pass

Phase 1.1 (diagnose_chain_coverage.py) shipped on yaml-audit; Phase
1.2-1.6 (lenient sweep, tier merge) still pending. See CHAIN_ROADMAP.md
Progress Log for the resumable cursor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:39:05 -04:00
Vijay Janapa Reddi
af5f25f543 feat(vault-cli): diagnose_chain_coverage.py — surface buckets needing chains
Loads the published corpus (via vault_cli.policy — single source of truth)
and chains.json, buckets by (track, topic), and emits chain-coverage.json
with two cuts:
  - uncovered_buckets: ≥3 questions, 0 chains
  - under_covered_buckets: ≥6 questions, ≤1 chain
Plus per-track summary + top-10 uncovered for quick read.

Output is gitignored — regeneratable, fed to Phase 1.4's --buckets-from.

Phase 1.1 of CHAIN_ROADMAP.md. See progress log for the run results
(211 uncovered buckets, edge/mobile/tinyml chain density 0.6-0.8 vs
cloud's 2.95, biggest miss is cloud:roofline-analysis at 144q/0 chains).
2026-04-30 18:15:59 -04:00
Vijay Janapa Reddi
3526176384 feat(vault-cli): diagnose_chain_coverage.py — surface buckets needing chains
Loads the published corpus (via vault_cli.policy — single source of truth)
and chains.json, buckets by (track, topic), and emits chain-coverage.json
with two cuts:
  - uncovered_buckets: ≥3 questions, 0 chains
  - under_covered_buckets: ≥6 questions, ≤1 chain
Plus per-track summary + top-10 uncovered for quick read.

Output is gitignored — regeneratable, fed to Phase 1.4's --buckets-from.

Phase 1.1 of CHAIN_ROADMAP.md. See progress log for the run results
(211 uncovered buckets, edge/mobile/tinyml chain density 0.6-0.8 vs
cloud's 2.95, biggest miss is cloud:roofline-analysis at 144q/0 chains).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:15:59 -04:00
Vijay Janapa Reddi
9fdbfb9a4c refactor(vault-cli): rename --legacy-json to --local-json
The flag is the StaffML frontend's local-dev fallback (read corpus.json
from disk via NEXT_PUBLIC_VAULT_FALLBACK=static), not a deprecated path.
"Legacy" implied "soon to be removed"; "local-json" describes its actual
role and reads correctly in scripts and docs.

- vault-cli: rename CLI flag, parameter, result key, and help text.
- CI workflows + pre-commit config: invoke the new flag name.
- All scripts that print the command (suggest_exemplars,
  pre_commit_corpus_guard, promote_validated, rename_legacy_ids,
  export_to_staffml, the paper analyze_corpus/generate_*) updated.
- Comments and docs (ARCHITECTURE, CHANGELOG, REVIEWS, TESTING,
  MASSIVE_BUILD_RUNBOOK, DEPRECATED, AUTHORING, plus frontend
  comments and .env.example / .gitignore) updated.

The "legacy_json" sentinel string in corpus_stats.json._meta.source
is intentionally NOT renamed — it is a stable artifact format read
by downstream paper-generation tooling.
2026-04-30 09:30:28 -04:00
Vijay Janapa Reddi
2b381bb949 refactor(vault-cli): rename --legacy-json to --local-json
The flag is the StaffML frontend's local-dev fallback (read corpus.json
from disk via NEXT_PUBLIC_VAULT_FALLBACK=static), not a deprecated path.
"Legacy" implied "soon to be removed"; "local-json" describes its actual
role and reads correctly in scripts and docs.

- vault-cli: rename CLI flag, parameter, result key, and help text.
- CI workflows + pre-commit config: invoke the new flag name.
- All scripts that print the command (suggest_exemplars,
  pre_commit_corpus_guard, promote_validated, rename_legacy_ids,
  export_to_staffml, the paper analyze_corpus/generate_*) updated.
- Comments and docs (ARCHITECTURE, CHANGELOG, REVIEWS, TESTING,
  MASSIVE_BUILD_RUNBOOK, DEPRECATED, AUTHORING, plus frontend
  comments and .env.example / .gitignore) updated.

The "legacy_json" sentinel string in corpus_stats.json._meta.source
is intentionally NOT renamed — it is a stable artifact format read
by downstream paper-generation tooling.
2026-04-30 09:30:28 -04:00
Vijay Janapa Reddi
d82a4f00aa fix(chains): tolerate Gemini CLI exit-1 + add inter-call backoff 2026-04-30 09:22:57 -04:00
Vijay Janapa Reddi
681e404633 feat(chains): add gap detection + multi-chain UI helpers
build_chains_with_gemini.py: prompt now asks Gemini to also surface
missing-rung gaps — e.g., 'this bucket has L1 + L3 questions on the same
scenario thread but no L2 to bridge them.' Gaps are captured to
interviews/vault/gaps.proposed.json as a separate authoring backlog.
This is a free signal: it costs no extra calls, identifies pedagogical
holes the corpus doesn't yet fill, and feeds a future generation pass
(with independent validation before any new question is committed).

corpus.ts: getChainForQuestion now accepts an optional preferredChainId
so multi-chain questions can disambiguate via URL (?chain=...). Adds
getAllChainsForQuestion() returning every chain a qid belongs to.
Default behavior unchanged when only one chain exists.
2026-04-30 09:02:35 -04:00
Vijay Janapa Reddi
0b14e08b52 feat(vault-cli): summarize_proposed_chains.py — quick-read report on staging
After build_chains_with_gemini.py produces chains.proposed.json, this
script gives a one-shot summary: chain count, size distribution, level-Δ
histogram, multi-chain membership stats, sample chains for spot-check.
2026-04-30 09:00:04 -04:00
Vijay Janapa Reddi
d8a55f3334 feat(chains): tighten progression rules + allow up to 2-chain membership
Gemini prompt + structural validator now enforce:
  - Consecutive Bloom delta MUST be 1 or 2 (rejects Δ=0 same-level pairs
    and Δ≥3 huge jumps; backward steps already impossible)
  - Strict +1 preferred; +2 accepted only when no +1 candidate exists
  - A question can appear in up to 2 chains, but only if it's L1 or L2
    (foundational anchor pattern); 3+ chain memberships are rejected as
    over-stuffing

Empirical alignment: 70% of legacy chains were strict +1, 19% had +2
jumps, 8% had +3 jumps that we now reject as too-large pedagogical
moves. The new rules tighten quality while keeping the bulk of
defensible existing structure expressible.
2026-04-30 08:58:38 -04:00
Vijay Janapa Reddi
8423dcb08f feat(vault-cli): Gemini-powered chain builder + apply script
build_chains_with_gemini.py — adaptive batched chain proposal:
  - Buckets corpus by (track, topic), packs into ~80K-token batches
  - Calls gemini-3.1-pro-preview with structured-output prompt
  - Validates each proposed chain (size 2-6, monotonic, single-topic,
    members exist, no cross-chain duplicates)
  - Writes staging chains.proposed.json (never touches live registry)

Full-corpus plan: 313 buckets pack into 44 calls (well under 250/day Pro
cap, uses ~70K input tokens per call out of 1M context).

Test on tinyml:network-bandwidth-bottlenecks (6 questions) -> 2 well-formed
chains, Bloom-monotonic with coherent rationale (Hailo-8 PCIe arc + BLE
network arc).

apply_proposed_chains.py — gated migration:
  - Re-validates staging file against live YAML corpus
  - Backs up chains.json -> chains.json.bak
  - Refuses to apply if any structural invariant fails
2026-04-30 08:53:06 -04:00
Vijay Janapa Reddi
e43ff34719 feat(vault-cli): chain audit + rescue suggestions with embedding similarity
Adds two subcommands and supporting modules:

  vault chains audit
    Reports chain health: orphans, position-drift (gaps from filtered
    members), stale-registry, intra-chain cosine distribution, weakest
    chains list. Embedding-aware via --no-embeddings escape hatch.

  vault chains suggest
    For each orphan singleton, ranks rescue candidates within the same
    (track, topic) bucket. Hybrid scoring:
      HARD filter: level_delta in {0, 1, 2} (matches 92% of observed
                   chain edges across the corpus)
      SOFT rank:   embedding cosine + delta=1 priority
      Bands:       strong-merge / review-merge / below-threshold

Embeddings: bge-small-en-v1.5 (BAAI). Calibrated via
scripts/calibrate_chain_embeddings.py against the 726 healthy chains.
Empirical findings (in script header docstring):
  - bge-small precision@1 = 0.283, recall@3 = 0.447
  - bge-large gains only +0.013 P@1 at 7x embedding time — not worth it
  - Same-bucket questions are inherently close (μ_pos=0.785, μ_neg=0.757);
    so this is suggestion-only, never auto-apply.

Cross-encoder rerank experiment script included for future research
(BAAI/bge-reranker-base) — current run OOM'd on 16GB; deferred.

Embedding cache (.npz) is gitignored — reproducible from source.
2026-04-29 19:00:09 -04:00
Vijay Janapa Reddi
43dedf9948 docs(vault): update architecture docs and audit scripts for 87-topic baseline
Update ARCHITECTURE.md to reflect 87 curated topics and 131 edges. Refactor exemplar_coverage_audit.py to use vault.db instead of retired corpus.json. Update exemplar-gaps.yaml inventory.
2026-04-26 16:47:56 -04:00