cs249r_book

github-starred/cs249r_book

Fork 0

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-08 18:01:20 -05:00

Commit Graph

Author	SHA1	Message	Date
Vijay Janapa Reddi	ece6eccf23	feat(vault): massive build — 630 drafts generated, 320 PASS promoted, paper 0.1.1 Phase 1 (analyzer): top-priority cells: tinyml/parallelism (0/90), tinyml/networking (2/90), mobile/parallelism (0/127), edge/parallelism (12/152), global/L4-L6+ deeply empty. Phase 2 (loop): 6 iterations, 50 of 80 API calls used, 630 drafts generated (52% PASS / 19% NEEDS_FIX / 26% DROP / ~6% unjudged). Saturation reason: same top-priority cell two iterations in a row — converged. Top-priority decay 2.25 → 2.14 → 2.03 → 1.93 → 1.83 plateaued; generator cannot meaningfully shrink tinyml/specification/L6+ further within current prompt framing. Both halt conditions (gap-threshold 0.8, max-calls 80) had headroom; structural convergence fired first. Loop defaults bumped: max-iters 20 → 30, max-calls 60 → 80, batch 12 → 30, calls/iter 3 → 4, judge chunk 15 → 25. Phase 3 (quality): Spot-read 4 PASS items + visuals across cloud/edge/ mobile/tinyml. All technically sound, math correct, real hardware grounding (MI300X, Jetson Orin, Cortex-M4 BLE), SVGs follow svg-style.md palette. Systemic finding: generator emitted 462 drafts with malformed competency_area values (60 distinct patterns: zones-as-area, bloom-verbs-as-area, underscore hallucinations, dash-form/slash-form concatenations). Resolved by extending fix_competency_areas.py REMAP table; re-run cleanup mapped all 462 to canonical. Root cause — generator skips Pydantic validation at write time — flagged for follow-on fix; not blocking. Phase 4 (promote): 320 PASS items promoted; bundle 9,224 → 9,544 published (exactly +320). Visual assets: 234 in bundle, mirrored to staffml/public/. Phase 5 (paper): Cut 0.1.1 release (patch bump: content addition, no schema change). release_hash 0350da5706e6. macros.tex regenerated to 9,544/87 topics/ 13 areas/11 zones; 4 figures rebuilt; paper.tex zone counts updated (1,583/1,227/1,113 → 1,615/1,256/1,144). PDF compiles to 25 pages, no LaTeX errors (citation warnings pre-existing). Phase 6 (GUI): All 8 Playwright tests pass on fresh dev server. /practice HTML contains zero malformed area names (down from 60 distinct pre-fix). Phase 7 (manifest): vault-manifest.json refreshed: questionCount 9224 → 9544, contentHash 539eb877f9cc → 0350da5706e6, track + level distributions updated to match 0.1.1 corpus. Loop run dir: interviews/vault/_validation_results/coverage_loop/20260425_150712 Deferred queue (next session): 120 NEEDS_FIX items carrying judge fix_suggestions + 165 DROP items, plus the generator validate-at-write fix. The runbook (vault/docs/MASSIVE_BUILD_RUNBOOK.md) is the methodology this session followed; can be re-run on any future generation day.	2026-04-25 13:15:41 -04:00
Vijay Janapa Reddi	0afc384282	feat(vault): LLM-as-judge validator + iterative coverage loop Two new pieces close the generation→validation→saturation feedback loop: 1. gemini_cli_llm_judge.py — multi-criteria validator. For each draft, judges math correctness, cell-fit (does it actually target the declared track/zone/level?), scenario realism, uniqueness vs canonical questions, and visual-asset alignment. Returns PASS/NEEDS_FIX/DROP per item. Batched (default 15 per call) for budget efficiency. 2. iterate_coverage_loop.py — drives the full loop: analyze → plan → generate → render → judge → apply → re-analyze. Self-paced: stops when (a) top priority gap drops below threshold, (b) DROP rate exceeds the saturation/hallucination threshold, (c) total API calls exceed budget, or (d) the same cell is top priority for two iterations in a row (convergence). The user no longer specifies "how many questions" — the loop generates until the corpus reaches a measurable steady state. Plus 25 round-1 visual questions generated by the new batched generator (5 batched calls × 5 cells each, zero failures). The loop is the answer to "we need balance, not just volume": every iteration's plan derives from a fresh analysis of where coverage is weakest, so generation can never over-fill an already-saturated cell.	2026-04-25 09:18:32 -04:00

Author

SHA1

Message

Date

Vijay Janapa Reddi

ece6eccf23

feat(vault): massive build — 630 drafts generated, 320 PASS promoted, paper 0.1.1

Phase 1 (analyzer):  top-priority cells: tinyml/parallelism (0/90),
                     tinyml/networking (2/90), mobile/parallelism (0/127),
                     edge/parallelism (12/152), global/L4-L6+ deeply empty.
Phase 2 (loop):      6 iterations, 50 of 80 API calls used, 630 drafts
                     generated (52% PASS / 19% NEEDS_FIX / 26% DROP /
                     ~6% unjudged). Saturation reason: same top-priority
                     cell two iterations in a row — converged. Top-priority
                     decay 2.25 → 2.14 → 2.03 → 1.93 → 1.83 plateaued;
                     generator cannot meaningfully shrink
                     tinyml/specification/L6+ further within current
                     prompt framing. Both halt conditions (gap-threshold
                     0.8, max-calls 80) had headroom; structural
                     convergence fired first. Loop defaults bumped:
                     max-iters 20 → 30, max-calls 60 → 80, batch 12 → 30,
                     calls/iter 3 → 4, judge chunk 15 → 25.
Phase 3 (quality):   Spot-read 4 PASS items + visuals across cloud/edge/
                     mobile/tinyml. All technically sound, math correct,
                     real hardware grounding (MI300X, Jetson Orin,
                     Cortex-M4 BLE), SVGs follow svg-style.md palette.
                     Systemic finding: generator emitted 462 drafts with
                     malformed competency_area values (60 distinct
                     patterns: zones-as-area, bloom-verbs-as-area,
                     underscore hallucinations, dash-form/slash-form
                     concatenations). Resolved by extending
                     fix_competency_areas.py REMAP table; re-run cleanup
                     mapped all 462 to canonical. Root cause —
                     generator skips Pydantic validation at write time —
                     flagged for follow-on fix; not blocking.
Phase 4 (promote):   320 PASS items promoted; bundle 9,224 → 9,544
                     published (exactly +320). Visual assets: 234 in
                     bundle, mirrored to staffml/public/.
Phase 5 (paper):     Cut 0.1.1 release (patch bump: content addition,
                     no schema change). release_hash 0350da5706e6.
                     macros.tex regenerated to 9,544/87 topics/
                     13 areas/11 zones; 4 figures rebuilt; paper.tex
                     zone counts updated (1,583/1,227/1,113 →
                     1,615/1,256/1,144). PDF compiles to 25 pages,
                     no LaTeX errors (citation warnings pre-existing).
Phase 6 (GUI):       All 8 Playwright tests pass on fresh dev server.
                     /practice HTML contains zero malformed area names
                     (down from 60 distinct pre-fix).
Phase 7 (manifest):  vault-manifest.json refreshed: questionCount
                     9224 → 9544, contentHash 539eb877f9cc → 0350da5706e6,
                     track + level distributions updated to match
                     0.1.1 corpus.

Loop run dir: interviews/vault/_validation_results/coverage_loop/20260425_150712
Deferred queue (next session): 120 NEEDS_FIX items carrying judge
fix_suggestions + 165 DROP items, plus the generator validate-at-write fix.

The runbook (vault/docs/MASSIVE_BUILD_RUNBOOK.md) is the methodology
this session followed; can be re-run on any future generation day.

2026-04-25 13:15:41 -04:00

Vijay Janapa Reddi

0afc384282

feat(vault): LLM-as-judge validator + iterative coverage loop

Two new pieces close the generation→validation→saturation feedback loop:

1. gemini_cli_llm_judge.py — multi-criteria validator. For each draft,
   judges math correctness, cell-fit (does it actually target the
   declared track/zone/level?), scenario realism, uniqueness vs canonical
   questions, and visual-asset alignment. Returns PASS/NEEDS_FIX/DROP
   per item. Batched (default 15 per call) for budget efficiency.

2. iterate_coverage_loop.py — drives the full loop:
   analyze → plan → generate → render → judge → apply → re-analyze.
   Self-paced: stops when (a) top priority gap drops below threshold,
   (b) DROP rate exceeds the saturation/hallucination threshold,
   (c) total API calls exceed budget, or (d) the same cell is top
   priority for two iterations in a row (convergence). The user no
   longer specifies "how many questions" — the loop generates until
   the corpus reaches a measurable steady state.

Plus 25 round-1 visual questions generated by the new batched generator
(5 batched calls × 5 cells each, zero failures).

The loop is the answer to "we need balance, not just volume": every
iteration's plan derives from a fresh analysis of where coverage is
weakest, so generation can never over-fill an already-saturated cell.

2026-04-25 09:18:32 -04:00

2 Commits