2 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
ece6eccf23 feat(vault): massive build — 630 drafts generated, 320 PASS promoted, paper 0.1.1
Phase 1 (analyzer):  top-priority cells: tinyml/parallelism (0/90),
                     tinyml/networking (2/90), mobile/parallelism (0/127),
                     edge/parallelism (12/152), global/L4-L6+ deeply empty.
Phase 2 (loop):      6 iterations, 50 of 80 API calls used, 630 drafts
                     generated (52% PASS / 19% NEEDS_FIX / 26% DROP /
                     ~6% unjudged). Saturation reason: same top-priority
                     cell two iterations in a row — converged. Top-priority
                     decay 2.25 → 2.14 → 2.03 → 1.93 → 1.83 plateaued;
                     generator cannot meaningfully shrink
                     tinyml/specification/L6+ further within current
                     prompt framing. Both halt conditions (gap-threshold
                     0.8, max-calls 80) had headroom; structural
                     convergence fired first. Loop defaults bumped:
                     max-iters 20 → 30, max-calls 60 → 80, batch 12 → 30,
                     calls/iter 3 → 4, judge chunk 15 → 25.
Phase 3 (quality):   Spot-read 4 PASS items + visuals across cloud/edge/
                     mobile/tinyml. All technically sound, math correct,
                     real hardware grounding (MI300X, Jetson Orin,
                     Cortex-M4 BLE), SVGs follow svg-style.md palette.
                     Systemic finding: generator emitted 462 drafts with
                     malformed competency_area values (60 distinct
                     patterns: zones-as-area, bloom-verbs-as-area,
                     underscore hallucinations, dash-form/slash-form
                     concatenations). Resolved by extending
                     fix_competency_areas.py REMAP table; re-run cleanup
                     mapped all 462 to canonical. Root cause —
                     generator skips Pydantic validation at write time —
                     flagged for follow-on fix; not blocking.
Phase 4 (promote):   320 PASS items promoted; bundle 9,224 → 9,544
                     published (exactly +320). Visual assets: 234 in
                     bundle, mirrored to staffml/public/.
Phase 5 (paper):     Cut 0.1.1 release (patch bump: content addition,
                     no schema change). release_hash 0350da5706e6.
                     macros.tex regenerated to 9,544/87 topics/
                     13 areas/11 zones; 4 figures rebuilt; paper.tex
                     zone counts updated (1,583/1,227/1,113 →
                     1,615/1,256/1,144). PDF compiles to 25 pages,
                     no LaTeX errors (citation warnings pre-existing).
Phase 6 (GUI):       All 8 Playwright tests pass on fresh dev server.
                     /practice HTML contains zero malformed area names
                     (down from 60 distinct pre-fix).
Phase 7 (manifest):  vault-manifest.json refreshed: questionCount
                     9224 → 9544, contentHash 539eb877f9cc → 0350da5706e6,
                     track + level distributions updated to match
                     0.1.1 corpus.

Loop run dir: interviews/vault/_validation_results/coverage_loop/20260425_150712
Deferred queue (next session): 120 NEEDS_FIX items carrying judge
fix_suggestions + 165 DROP items, plus the generator validate-at-write fix.

The runbook (vault/docs/MASSIVE_BUILD_RUNBOOK.md) is the methodology
this session followed; can be re-run on any future generation day.
2026-04-25 13:15:41 -04:00
Vijay Janapa Reddi
0afc384282 feat(vault): LLM-as-judge validator + iterative coverage loop
Two new pieces close the generation→validation→saturation feedback loop:

1. gemini_cli_llm_judge.py — multi-criteria validator. For each draft,
   judges math correctness, cell-fit (does it actually target the
   declared track/zone/level?), scenario realism, uniqueness vs canonical
   questions, and visual-asset alignment. Returns PASS/NEEDS_FIX/DROP
   per item. Batched (default 15 per call) for budget efficiency.

2. iterate_coverage_loop.py — drives the full loop:
   analyze → plan → generate → render → judge → apply → re-analyze.
   Self-paced: stops when (a) top priority gap drops below threshold,
   (b) DROP rate exceeds the saturation/hallucination threshold,
   (c) total API calls exceed budget, or (d) the same cell is top
   priority for two iterations in a row (convergence). The user no
   longer specifies "how many questions" — the loop generates until
   the corpus reaches a measurable steady state.

Plus 25 round-1 visual questions generated by the new batched generator
(5 batched calls × 5 cells each, zero failures).

The loop is the answer to "we need balance, not just volume": every
iteration's plan derives from a fresh analysis of where coverage is
weakest, so generation can never over-fill an already-saturated cell.
2026-04-25 09:18:32 -04:00