8 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
2b381bb949 refactor(vault-cli): rename --legacy-json to --local-json
The flag is the StaffML frontend's local-dev fallback (read corpus.json
from disk via NEXT_PUBLIC_VAULT_FALLBACK=static), not a deprecated path.
"Legacy" implied "soon to be removed"; "local-json" describes its actual
role and reads correctly in scripts and docs.

- vault-cli: rename CLI flag, parameter, result key, and help text.
- CI workflows + pre-commit config: invoke the new flag name.
- All scripts that print the command (suggest_exemplars,
  pre_commit_corpus_guard, promote_validated, rename_legacy_ids,
  export_to_staffml, the paper analyze_corpus/generate_*) updated.
- Comments and docs (ARCHITECTURE, CHANGELOG, REVIEWS, TESTING,
  MASSIVE_BUILD_RUNBOOK, DEPRECATED, AUTHORING, plus frontend
  comments and .env.example / .gitignore) updated.

The "legacy_json" sentinel string in corpus_stats.json._meta.source
is intentionally NOT renamed — it is a stable artifact format read
by downstream paper-generation tooling.
2026-04-30 09:30:28 -04:00
Vijay Janapa Reddi
542aaf95d2 cleanup(vault): release-ready Phase A — schema hardening + lint calibration + chain repair
Closes the cleanup arc (A.1–A.10 in RESUME_PLAN_RELEASE.md). Every
gate is now green: vault check --strict, vault lint, vault doctor,
vault codegen --check, staffml validate-vault, Playwright (9/9), tsc.

A.1 mobile-1962.svg: renamed `Edge` → `RegEdge` in graphviz source
    (`Edge` is a reserved keyword); SVG renders cleanly. Also fixed
    tinyml-1570.py (missing `import numpy as np`) which the new failure
    log surfaced.

A.2 render_visuals.py: structured per-ID failure log written to
    `_validation_results/render_failures.json` on every run; non-zero
    exit on any per-item crash; new `--fail-fast` and `--failure-log`
    CLI options. Replaces the prior silent-failure mode.

A.3 LinkML visual schema: typed as a structured sub-schema. New
    `VisualKind` enum (svg only — `mermaid` was reserved but never
    shipped, dropped to keep the enum honest). Path regex tightened
    to `^[a-z0-9-]+\.svg$`. Alt minimum length 10, caption required
    minimum length 5. TypeScript Visual interface + Question.visual
    field added to staffml-vault-types/index.ts.

A.4 Pydantic Visual + Question validators:
    - Visual.kind hard-rejects anything but `svg`
    - Visual.path enforces the new regex
    - Visual.alt min 10 chars, caption required min 5 chars
    - Question.model_validator: visual.path MUST resolve to a real
      file under interviews/vault/visuals/<track>/. Skipped in
      production deploys where the working tree is absent.

A.5 Registry repair + doctor split:
    - tools: repair_registry.py appended 5,269 missing IDs
      (the rename refactor at 8a5c3ff3c left the append-only registry
      unsynced; this brings disk-coverage to 100%). Header block in
      id-registry.yaml documents the rebuild rationale.
    - doctor.py: split symmetric `registry-integrity` check into
      `disk-coverage` (HARD FAIL if any disk YAML id is unregistered)
      and `registry-history` (INFO ONLY for retired ids — the registry
      is by design an audit log, retired ids are normal). Pre-existing
      `_check_schema_version` bug (`versions == {1}` vs string `"1.0"`)
      fixed.

A.6 Lint calibration via 4-expert consensus + bloom-canonical
    reclassification:
    - Spawned 4 experts (Vijay Reddi, Chip Huyen, Jeff Dean,
      education-reviewer) on 42 disputed (zone, level) pairs;
      consensus-builder aggregated to 15 valid / 19 invalid / 8
      borderline.
    - User arbitrated 8 borderlines: 7 widen / 1 reclassify.
    - Built ZONE_BLOOM_AFFINITY matrix (Education-Reviewer's idea):
      every zone admits its dominant Bloom verb + adjacent verbs,
      rejects clear hierarchy violations.
    - reclassify_zone_bloom_mismatch.py applied 576 deterministic
      zone fixes via BLOOM_CANONICAL_ZONE mapping (e.g. fluency+analyze
      → analyze, recall+analyze → analyze, evaluation+apply → implement).
    - Question.model_validator(_zone_bloom_compatible): hard-rejects
      future zone-bloom mismatches at write time. Generated drafts
      can no longer ship a self-contradicting classification.
    - ZONE_LEVEL_AFFINITY widened per consensus + arbitration +
      post-reclassification adjustments. Lint warnings: 1,308 → 0.

A.7 Chain integrity:
    - repair_chains.py: drops chain refs when a chain has <2 published
      members (chain ceases to exist), renumbers all members of any
      chain whose positions are non-sequential / duplicated /
      non-monotonic-by-level. Sort key: level ascending, then old
      position, then qid (deterministic).
    - validate-vault.py: relaxed sequential check to unique-positions
      check. Position gaps from mid-chain deletions are normal; what
      matters is uniqueness + bloom-monotonicity (vault check --strict
      enforces both from YAML source-of-truth).

A.8 Practice page visual + zoom modal:
    - QuestionVisual.tsx: wraps the `<img>` in `<Zoom>` from
      react-medium-image-zoom (4 KB). Click image → fullscreen
      `<dialog data-rmiz-modal>`; ESC closes. Added test-id
      `question-visual-img` for stable selector.
    - New Playwright test: 9th in the suite, deep-links cloud-4492,
      asserts the dialog opens on click and closes on ESC.
    - TypeScript: removed `mermaid` from local Visual types in
      corpus.ts and corpus-vault.ts; tsc clean.

A.9 All gates green:
    - vault check --strict: 0 errors / 0 invariant failures
    - vault lint: 0 errors / 0 warnings (was 1,308 warnings)
    - vault codegen --check: artifacts in sync (hash baseline updated)
    - vault doctor: 0 fails (registry-history info, git-state warn
      on uncommitted state-pre-this-commit)
    - staffml validate-vault: 0 errors / 0 warnings, deployment-ready
    - Playwright: 9/9 pass (was 8; +zoom modal test)
    - render_visuals: 0 errors (was 2 silent failures pre-A.2)
    - tsc: clean

Distribution after reclassification: 9,544 published unchanged;
576 items moved zone via bloom-canonical mapping (full per-item
report at /tmp/reclassify_changes.csv). Chain count 879 → 850
after orphan-singleton drops. release_hash updated.

Carry-forward to next session (Phase B):
- Priority gap closure for parallelism cells + global L4-L6+
  (the run that produced this corpus did not close the targeted
  cells; B.3 needs specialized prompts per cell-class)
- 120 NEEDS_FIX items from coverage_loop/20260425_150712/ still
  carry judge fix_suggestions; spawn fix-agent in Phase C
2026-04-25 15:12:51 -04:00
Vijay Janapa Reddi
ece6eccf23 feat(vault): massive build — 630 drafts generated, 320 PASS promoted, paper 0.1.1
Phase 1 (analyzer):  top-priority cells: tinyml/parallelism (0/90),
                     tinyml/networking (2/90), mobile/parallelism (0/127),
                     edge/parallelism (12/152), global/L4-L6+ deeply empty.
Phase 2 (loop):      6 iterations, 50 of 80 API calls used, 630 drafts
                     generated (52% PASS / 19% NEEDS_FIX / 26% DROP /
                     ~6% unjudged). Saturation reason: same top-priority
                     cell two iterations in a row — converged. Top-priority
                     decay 2.25 → 2.14 → 2.03 → 1.93 → 1.83 plateaued;
                     generator cannot meaningfully shrink
                     tinyml/specification/L6+ further within current
                     prompt framing. Both halt conditions (gap-threshold
                     0.8, max-calls 80) had headroom; structural
                     convergence fired first. Loop defaults bumped:
                     max-iters 20 → 30, max-calls 60 → 80, batch 12 → 30,
                     calls/iter 3 → 4, judge chunk 15 → 25.
Phase 3 (quality):   Spot-read 4 PASS items + visuals across cloud/edge/
                     mobile/tinyml. All technically sound, math correct,
                     real hardware grounding (MI300X, Jetson Orin,
                     Cortex-M4 BLE), SVGs follow svg-style.md palette.
                     Systemic finding: generator emitted 462 drafts with
                     malformed competency_area values (60 distinct
                     patterns: zones-as-area, bloom-verbs-as-area,
                     underscore hallucinations, dash-form/slash-form
                     concatenations). Resolved by extending
                     fix_competency_areas.py REMAP table; re-run cleanup
                     mapped all 462 to canonical. Root cause —
                     generator skips Pydantic validation at write time —
                     flagged for follow-on fix; not blocking.
Phase 4 (promote):   320 PASS items promoted; bundle 9,224 → 9,544
                     published (exactly +320). Visual assets: 234 in
                     bundle, mirrored to staffml/public/.
Phase 5 (paper):     Cut 0.1.1 release (patch bump: content addition,
                     no schema change). release_hash 0350da5706e6.
                     macros.tex regenerated to 9,544/87 topics/
                     13 areas/11 zones; 4 figures rebuilt; paper.tex
                     zone counts updated (1,583/1,227/1,113 →
                     1,615/1,256/1,144). PDF compiles to 25 pages,
                     no LaTeX errors (citation warnings pre-existing).
Phase 6 (GUI):       All 8 Playwright tests pass on fresh dev server.
                     /practice HTML contains zero malformed area names
                     (down from 60 distinct pre-fix).
Phase 7 (manifest):  vault-manifest.json refreshed: questionCount
                     9224 → 9544, contentHash 539eb877f9cc → 0350da5706e6,
                     track + level distributions updated to match
                     0.1.1 corpus.

Loop run dir: interviews/vault/_validation_results/coverage_loop/20260425_150712
Deferred queue (next session): 120 NEEDS_FIX items carrying judge
fix_suggestions + 165 DROP items, plus the generator validate-at-write fix.

The runbook (vault/docs/MASSIVE_BUILD_RUNBOOK.md) is the methodology
this session followed; can be re-run on any future generation day.
2026-04-25 13:15:41 -04:00
Vijay Janapa Reddi
0afc384282 feat(vault): LLM-as-judge validator + iterative coverage loop
Two new pieces close the generation→validation→saturation feedback loop:

1. gemini_cli_llm_judge.py — multi-criteria validator. For each draft,
   judges math correctness, cell-fit (does it actually target the
   declared track/zone/level?), scenario realism, uniqueness vs canonical
   questions, and visual-asset alignment. Returns PASS/NEEDS_FIX/DROP
   per item. Batched (default 15 per call) for budget efficiency.

2. iterate_coverage_loop.py — drives the full loop:
   analyze → plan → generate → render → judge → apply → re-analyze.
   Self-paced: stops when (a) top priority gap drops below threshold,
   (b) DROP rate exceeds the saturation/hallucination threshold,
   (c) total API calls exceed budget, or (d) the same cell is top
   priority for two iterations in a row (convergence). The user no
   longer specifies "how many questions" — the loop generates until
   the corpus reaches a measurable steady state.

Plus 25 round-1 visual questions generated by the new batched generator
(5 batched calls × 5 cells each, zero failures).

The loop is the answer to "we need balance, not just volume": every
iteration's plan derives from a fresh analysis of where coverage is
weakest, so generation can never over-fill an already-saturated cell.
2026-04-25 09:18:32 -04:00
Vijay Janapa Reddi
612885a952 refactor(vault): visual schema aligns with website + 5 more Gemini-generated visuals
Schema fix: visual.kind is always 'svg' (the format the website ships) and
visual.path points to that asset. The build-pipeline format is recorded as
optional metadata in visual.source_format ('dot' | 'matplotlib' | 'hand'),
which the website ignores. This separates "what users render" from "how
maintainers built it".

Source files live next to the SVG by naming convention; the renderer infers
the path from the YAML's source_format hint without a dedicated source field.

Five new visual exemplars generated by Gemini 3.1 Pro Preview, covering
diverse archetypes:
- cloud-2849 (DOT): incast-bottleneck topology
- cloud-2850 (DOT): leaf-spine fabric with 2:1 oversubscription
- cloud-2851 (matplotlib): bandwidth bar chart for data pipeline diagnosis
- cloud-2852 (matplotlib): checkpoint/recovery timeline with RPO/RTO
- edge-0972 (matplotlib): Poisson vs bursty queueing curves

Plus the four prior exemplars (cloud-2846, 2847, 2848, tinyml-0816)
re-emitted under the new schema. cloud-visual-001 unchanged — already had
the correct shape.

ARCHITECTURE.md rewritten to document the simpler three-layer separation
(website / build / authoring).
2026-04-25 08:57:26 -04:00
Vijay Janapa Reddi
f435185671 feat(vault): Gemini 3.1 Pro question generator with optional visual archetypes
gemini_cli_generate_questions.py mirrors gemini_cli_math_review.py's design:
review-first, JSON-strict, model pinned to gemini-3.1-pro-preview with a hard
guard against override. Targets weak coverage cells from the portfolio
balance loop or explicit --target track:topic:zone:level cells.

For visual-eligible topics (the 10 archetypes in audit_visual_questions.py),
the generator also produces the diagram source artifact (DOT or matplotlib
script) which render_visuals.py converts to a ship-ready SVG. This closes
the generation→render→validate loop using two different model passes:
Gemini drafts; the math review verifies.

First generated example: tinyml-0816 (wake-word duty-cycle evaluation) with
a matplotlib power-timeline visual. Math review returned CORRECT on the
first call. Status remains draft pending broader cross-validation.
2026-04-25 08:47:41 -04:00
Vijay Janapa Reddi
38e5c99f17 feat(vault): multi-format visual question architecture (DOT + matplotlib + SVG)
ARCHITECTURE.md establishes that visuals are a property of any question, not
a separate category. Three supported formats let the layout engine do the
work: DOT for graph topology, matplotlib for curves and Gantt charts, hand
SVG for custom layouts.

render_visuals.py is the single entry point that dispatches by visual.kind,
runs the appropriate tool, and normalizes the rendered SVG to the book's
font stack. It is idempotent and supports --dry-run.

Three exemplars cover the three formats:
- cloud-2846 (DOT): Tree AllReduce on 8 ranks — auto-laid-out topology
- cloud-2847 (matplotlib): Queueing hockey-stick curve with SLO line
- cloud-2848 (matplotlib): Pipeline-bubble Gantt for GPipe schedule

All three are status:draft pending math review and promotion in a later
batch. Existing cloud-visual-001 remains unchanged as the canonical
hand-SVG exemplar.
2026-04-25 08:42:59 -04:00
Vijay Janapa Reddi
1898fe8c9a feat(vault): add first visual-question exemplar + authoring guide
Seeds the visuals/ directory with a reference pattern so future
authors have a concrete template to clone.

Exemplar: Ring AllReduce on 4 ranks (cloud track, L3, apply/analyze).
- SVG follows .claude/rules/svg-style.md: 680×460 viewBox, Helvetica
  Neue, compute-blue ranks, orthogonal ring arrows, 10-px grid.
- YAML wires the visual block (kind=svg, path=cloud-visual-001.svg,
  alt + caption) and pairs it with a matching question: 'Using the
  diagram, calculate the total time to complete the full AllReduce.'
- The realistic_solution walks through 2(N−1)/N × data / bw and
  explains the common failure mode (forgetting the all-gather phase).
  Napkin math shows the step-time decomposition.

AUTHORING.md: the when/how/why guide for future visual questions.
- When a visual earns its place — three criteria (ask requires the
  diagram, encodes info text cannot, static suffices).
- High-value candidate topics — ring/tree AllReduce, roofline, KV
  cache, pipeline bubbles, memory hierarchy, MCU memory maps,
  systolic arrays, attention, MoE.
- Step-by-step authoring workflow pointing at the book's SVG style
  guide for the visual system — readers already know the visual
  vocabulary from the book, so consistency transfers.
- Accessibility requirements (non-negotiable): alt is enforced by
  the Pydantic schema, colour never the sole semantic channel, text
  in <text> elements not paths, WCAG AA contrast.
- Explicit anti-patterns: no inline SVG in YAML, no mermaid for
  non-graph content, no decorative effects, no label duplication of
  scenario prose.
2026-04-24 16:10:54 -04:00