Commit Graph

46 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
90b2abd178 feat(vault): add semantic-audit pipeline for question corpus QA
Adds the deterministic and semantic audit tooling used to drive the
release-readiness pass on the YAML question corpus:

- audit_yaml_corpus.py        — read-only schema + authoring-convention audit
- format_yaml_questions.py    — canonical formatter (idempotent)
- fix_yaml_hygiene.py         — bulk hygiene fixups
- prepare_semantic_review_queue.py — emit JSONL queues per track for LLM review
- semantic_audit_questions.py — parallel LLM audit runner (gpt-5.4-mini)
- run_semantic_audit_tracks.py — per-track orchestrator wrapping the runner
- build_semantic_fix_queue.py — collect findings into a prioritized fix queue
- compare_semantic_passes.py  — diff two semantic-audit passes for stability
- summarize_semantic_audit.py — markdown summary from findings JSONL

Also adds interviews/vault/audit/README.md describing the workflow.

Audit output artifacts (semantic-review-queue/, semantic-review-results/,
fresh-yaml-audit/) are produced by these scripts on demand and remain
untracked.
2026-05-05 09:08:56 -04:00
Vijay Janapa Reddi
3f0773706f chore(vault): restore 6 unique-capability scripts as preserved-for-adaptation references
The Phase 0 cleanup removed 18 scripts as deprecated, but 6 of them have
unique-capability patterns not yet covered by the modern tooling. Restoring
them as reference patterns, not active scripts.

What's restored and why:

  gemini_backfill_question.py
    Idempotent corpus-walk + Gemini batch + thread-pool + JSON YAML
    round-trip. The "fix one field across thousands of YAMLs" pattern.
    To be mined in CORPUS_HARDENING_PLAN.md Phase 5.

  gpt_backfill_question.py
    OpenAI variant of the above. Cross-provider template.

  gemini_cli_generate_questions.py (35K)
    BATCHED generation: 12 cells per call with balanced track × area ×
    zone × level round-robin. `vault generate` does NOT batch — it calls
    once per question. This script's batching pattern is what we want
    when generating > 100 questions in bulk.

  generate.py (30K)
    Coverage-survey-driven generation engine: surveys the corpus, finds
    empty cells, generates to fill the emptiest first, stops when
    saturated. `vault generate` lacks this auto-balance loop.

  gemini_fix_errors.py
    Batch error-fixer with hardware-reference grounding (V100 / A100 /
    H100 / B200 / T4 specs as ground-truth context). To be mined for
    audit_corpus_batched.py --propose-fixes in Phase 5.

  deep_verify.py
    Claude Opus + extended thinking; SHOWS ITS WORK on every napkin-math
    claim. Useful as a tiebreaker on borderline math findings from the
    lightweight audit.

Each restored file has a 5-line STATUS comment block at the top
documenting what to adapt before running. DEPRECATED.md is restructured
to make the three categories explicit (removed / preserved-for-adaptation
/ active-migration), and adds an adaptation checklist that applies to
all preserved scripts (replace corpus.json loading, verify SDK pins,
update output paths, re-validate prompts, sample first).

Validation:
  vault check --strict — 10,711 loaded, 0 invariant failures
  pytest — 74/74
  ruff — clean
2026-05-03 07:50:28 -04:00
Vijay Janapa Reddi
56d3ed1551 chore(vault): remove 18 deprecated scripts per CORPUS_HARDENING_PLAN.md Phase 0
All 18 scripts pre-date the YAML-as-source-of-truth migration
(ARCHITECTURE.md v2.x, Phase 1) and are listed in DEPRECATED.md's
replaced-by table. The corpus.json they ran against is itself now a
build artifact (gitignored, regenerated by `vault build --local-json`).

Removed top-level (13):
  build_corpus.py        → vault build (walks YAML, emits vault.db)
  export_to_staffml.py   → vault build --local-json
  extract_taxonomy.py    → vault/taxonomy.yaml
  deep_verify.py         → audit_chains_with_gemini.py + validate_drafts.py
  gemini_*.py × 6        → Phase-7 vault generate / batched audit pipeline
  gpt_backfill_question.py
  gate.py                → obsolete after schema v1.0
  generate.py            → vault generate

Removed archive/ (5):
  expand_tracks.py, fill_zone_gaps.py, fill_gaps.sh, final_balance.sh,
  README.md (now-orphan).

DEPRECATED.md updated: replaced-by table reorganized as a removal log
for git-archaeology, with a note that historical implementations are
findable via `git log --diff-filter=D`.

Validation:
  vault check --strict — 10,711 loaded, 0 invariant failures
  pytest interviews/vault-cli/tests/ — 74/74
  ruff check interviews/vault-cli — clean

This is Phase 0 of CORPUS_HARDENING_PLAN.md.
2026-05-03 07:44:13 -04:00
Vijay Janapa Reddi
9fdbfb9a4c refactor(vault-cli): rename --legacy-json to --local-json
The flag is the StaffML frontend's local-dev fallback (read corpus.json
from disk via NEXT_PUBLIC_VAULT_FALLBACK=static), not a deprecated path.
"Legacy" implied "soon to be removed"; "local-json" describes its actual
role and reads correctly in scripts and docs.

- vault-cli: rename CLI flag, parameter, result key, and help text.
- CI workflows + pre-commit config: invoke the new flag name.
- All scripts that print the command (suggest_exemplars,
  pre_commit_corpus_guard, promote_validated, rename_legacy_ids,
  export_to_staffml, the paper analyze_corpus/generate_*) updated.
- Comments and docs (ARCHITECTURE, CHANGELOG, REVIEWS, TESTING,
  MASSIVE_BUILD_RUNBOOK, DEPRECATED, AUTHORING, plus frontend
  comments and .env.example / .gitignore) updated.

The "legacy_json" sentinel string in corpus_stats.json._meta.source
is intentionally NOT renamed — it is a stable artifact format read
by downstream paper-generation tooling.
2026-04-30 09:30:28 -04:00
Vijay Janapa Reddi
2a48177ace chore(vault): migrate question YAMLs to <track>/<area>/<id>.yaml hierarchy
10,701 file moves. Each YAML's track + competency_area fields are read
from the body; file moved to matching directory.

  Before: interviews/vault/questions/cloud/cloud-0643.yaml
  After:  interviews/vault/questions/cloud/precision/cloud-0643.yaml

Filenames and ids unchanged. 65 leaf directories (5 tracks x 13 areas);
max ~500 files per leaf instead of 4,368 in cloud/.

Validation:
  - vault check --strict: 0 invariant failures (10,701 loaded)
  - vault build release_hash unchanged: 56a1bd6...
  - vault-cli loader is recursive (rglob); requires no further changes

Also fixes 11 pre-existing typos surfaced by codespell during the rename
(homogenous→homogeneous, Affinitiy→Affinity, etc.)
2026-04-29 18:32:19 -04:00
Vijay Janapa Reddi
aa9373f88e refactor(vault): make path utilities + scripts hierarchy-tolerant
Prep for migration to <track>/<area>/<id>.yaml layout. paths.py now
accepts both 2-segment (legacy flat) and 3-segment (hierarchical) paths
under questions/. New helper metadata_from_path() returns (track, area).
path_for_question() takes optional competency_area kwarg.

Audit/repair scripts that globbed '*/*.yaml' (flat-only) switched to
rglob('*.yaml') so they work post-migration without further edits.
2026-04-29 18:29:07 -04:00
Vijay Janapa Reddi
eb71638630 feat(vault): release-grade Phase G — full audit + cleanup + 0.1.3 release
Final brute-force release-readiness pass: every gate green, 0.1.3
released and verified, every observable failure mode closed at source.

═══ AUDITS (G.A–G.D) ═══

G.A — gemini-3.1-pro-preview default everywhere. Active CLI scripts
    already used it; bulk-patched 6 legacy scripts (`generate_batch.py`,
    `validate_questions.py`, `generate_gaps.py`, `run_reviews.sh`,
    `generate.py`, `review_math.sh`) + WORKFLOW.md off `gemini-2.5-flash`
    or `gemini-2.5-pro` to `gemini-3.1-pro-preview`. Only `archive/`
    references remain (intentionally legacy).

G.B — Cloudflare workflow audit. `vault verify 0.1.1` correctly
    failed (YAMLs evolved since 0.1.1 cut). Confirmed `vault publish`,
    `vault deploy`, `vault ship`, `vault rollback`, `vault verify`,
    `vault snapshot`, `vault tag` all wired. Released 0.1.2 then 0.1.3
    to lock final state.

G.C — Visual asset integrity audit. 236/236 YAML visual references
    resolve, 0 orphan SVGs, 0 missing files, 0 unrendered sources.
    Clean.

G.D — Unit tests for new validators added at `tests/test_models.py`:
    15 tests covering Visual.kind enum, Visual.path regex, Visual.alt
    + caption min lengths + required, Question._zone_bloom_compatible
    (recall+remember accepted, recall+evaluate rejected, mastery+
    remember rejected, evaluation+evaluate accepted, design+create
    accepted), Question._visual_path_resolves. **15/15 pass.**

═══ CONTENT CLEANUP (G.E–G.L) ═══

G.E — Sample re-judge of 100 random cloud parallelism items via
    Gemini 3.1 Pro Preview (4 API calls): 53% PASS / 23% NEEDS_FIX /
    24% DROP. Surfaced legacy quality drift — items generated under
    pre-Phase-D laxer prompts were not meeting the new strict bar
    (math errors with bidirectional vs unidirectional NVLink,
    "Based on the diagram..." references with no diagram, deprecated
    practices like SSP for modern LLM training, wrong-track scenarios
    like Cortex-M4 in cloud track).

G.H — General-purpose cleanup agent on 47 flagged items:
    **31 rewritten** with PARALLELISM_RULES bar applied (concrete
    unidirectional NVLink 450 GB/s, IB NDR 25 GB/s, RoCE v2 22 GB/s,
    PCIe Gen3 12 GB/s; multi-step ring AllReduce arguments with the
    2(N-1)/N factor; non-obvious failure modes); **16 archived** with
    documented `deletion_reason` (mathematically broken premises,
    physics errors, topic-irreconcilable, direct duplicates).

G.L — Re-judge of 31 G.H rewrites: **23 PASS / 3 NEEDS_FIX / 5 DROP =
    74.2% pass rate**. The 8 still-failing items archived (after the
    cleanup pass still couldn't satisfy the strict bar). Contract:
    items get THREE chances — original generation, fix-agent, retry-
    fix — and if they still fail, archived not promoted. Honest.

═══ STUBBORN-FAIL ARCHIVES (Phase F residuals) ═══

After three independent fix-agent passes (Phase C, F.2, F.4), 4 items
remained NEEDS_FIX or DROP: edge-2390, edge-2401, mobile-1948,
tinyml-1681. Archived with `deletion_reason` documenting the 3-attempt
failure history. The cell may be structurally awkward; preserving
items for audit but removing from the bundle.

═══ ORPHAN CHAIN FIX ═══

After archives, `cloud-chain-359` had only 1 published member
(`cloud-1840`); its sibling `cloud-1845` got archived. Dropped the
chain ref from cloud-1840 + ran `repair_chains.py` to clean residual
references in archived YAMLs. `vault check --strict` now passes 0
chain warnings.

═══ E.2 / E.3 SHIPPED EARLIER IN PRIOR COMMIT ═══

(Documented in commit `20ea20005` for completeness):
- `vault build --legacy-json` auto-emits `vault-manifest.json`.
- `analyze_coverage_gaps.py --include-areas <areas>` flag.

═══ 0.1.3 FINAL RELEASE ═══

`vault publish 0.1.3` snapshot at `releases/0.1.3/`. Migrations:
+0 ~27 -28 (zero net new questions, 27 modified during cleanup, 28
archived/promoted). `vault verify 0.1.3` ✓ — release_hash
`793c06f414f2bf8391a8a5c56ec0ff8d76bfce4ab7c64ad12ecb83f6d932280e`
reconstructs from YAML. Latest symlink → 0.1.3.

═══ FINAL ALL-9-GATES SWEEP — ALL GREEN ═══

[1] vault check --strict          ✓ 10,701 / 0 errors / 0 invariants
[2] vault lint                    ✓ 0 errors / 0 warnings / 9,757 info
[3] vault doctor                  ✓ 0 fails (registry-history info OK)
[4] vault codegen --check         ✓ artifacts in sync
[5] vault verify 0.1.3            ✓ hash reconstructs from YAML
[6] staffml validate-vault        ✓ 0 errors / 0 warnings, deployment-ready
[7] render_visuals                ✓ 236 visuals, 0 errors
[8] tsc                           ✓ TypeScript clean
[9] Playwright                    ✓ 9/9 pass

═══ FINAL CORPUS STATE ═══

Bundle: 9,757 published (was 9,224 at branch cut, **+533 net** across
the full multi-session push, after all archives).

Total commits on branch since cut: 10.
Release tag latest: 0.1.3 (verified-clean).
Status: StaffML-day-ready. Ship it.
2026-04-25 19:45:32 -04:00
Vijay Janapa Reddi
20ea20005c feat(vault): release-readiness final pass — E.2 + E.3 + F.4/F.5 + CHANGELOG
Closes the release-readiness push. All 8 gates green: vault check,
lint, doctor, codegen, validate-vault, render, tsc, Playwright.
Bundle: 9,775 → 9,781 published.

E.2 — Auto-emit vault-manifest.json from `vault build --legacy-json`:
    Added `emit_manifest()` to `legacy_export.py` and wired it into
    `commands/build.py` after the legacy corpus emission. The manifest
    is now derived deterministically from the same `loaded` set that
    produced corpus.json — track + level distributions, contentHash,
    counts. Eliminates the recurring stale-manifest pre-commit failure
    that had to be patched by hand twice during this push.

E.3 — `--include-areas` flag in analyze_coverage_gaps.py:
    Injects forced area-targeted cells into the recommended_plan for
    each listed competency_area (parallelism, networking, etc.). For
    each (track, area) where area is in the include list, adds 1 cell
    per (canonical-topic × {L4, L5, L6+}) zone. Closes the structural
    mismatch where topic-priority ranking misses area-level gaps.
    Tested with `--include-areas parallelism`: plan now includes 21
    parallelism-topic cells (was 0 in stock plan).

F.4 — Third-pass fix-agent on 10 residuals (4 NEEDS_FIX + 6 DROP from
    F.1). Substantial rewrites; 0 archived. Major math corrections:
    - mobile-1948: KV cache reconstructed (96 MB / 2048 = 48 KB/token)
    - tinyml-1681: cycle-model with proper register spill (5912 → 7912)
    - tinyml-1716: serialization on single-core M4 (12 ms not 10 ms)
    - tinyml-1634: Young/Daly hours-conversion (139 s, not 2.31 s)
    - tinyml-1723: triple-buffer SRAM (43.5 KB → 19.5 KB)
    - edge-2401: log2(18) = 4.17 (was 3.6)

F.5 — Re-judge: 6 PASS / 2 NEEDS_FIX / 2 DROP (60% pass rate). 6 more
    promoted. The 2 still-NEEDS_FIX + 2 DROP after THREE rewrite
    passes are documented as genuinely-stubborn carry-forwards.

G.1 — Cloud parallelism spot-check: 12 stratified items reviewed,
    0 issues. Cloud's 326 parallelism items are still high-quality.

G.2 — CHANGELOG.md updated with comprehensive [0.1.2-dev] entry:
    schema changes, new validators, tooling additions, content
    additions, three documented lessons (validate-at-data-boundary,
    prompt-specificity-beats-budget, topic-priority-misses-area-gaps).

Cumulative recovery rate of NEEDS_FIX/DROP items via layered fix-
agents (Phase C + F.2 + F.4): 63 of 120 = 53%. The remaining 57 split
between DROP (genuinely unrecoverable) and items still in NEEDS_FIX
state (deferred to future passes).

Final cumulative state of branch:
- Bundle: 9,224 → 9,781 published (+557 net)
- Lint warnings: 1,308+ → 0
- Doctor fails: 1 → 0
- Pydantic validators: 1 → 4
- Playwright tests: 8 → 9
- Repair scripts: 0 → 5
- Generator features: basic → bloom-aware + topic-area mapping +
  parallelism prompt + retry-on-validate-fail + targets-from +
  validate-at-write
- Build pipeline: manual manifest → auto-emit
- Analyzer: topic-priority only → topic-priority + area-include flag
- Parallelism gap (the original mission): closed across all tracks
2026-04-25 18:55:31 -04:00
Vijay Janapa Reddi
6b2b3e0542 feat(vault): Phase D + F — parallelism gap closure (+87 PASS items)
Closes the parallelism + global L4-L6+ gaps that have been open across
three prior pushes. All gates green: vault check, lint, doctor, codegen,
validate-vault, render. Bundle: 9,688 → 9,775 published.

PARALLELISM GAP — finally closed:
  tinyml/parallelism:  1 → 8
  mobile/parallelism:  0 → 6
  edge/parallelism:   13 → 18
  global/parallelism:  0 → 19
  cloud/parallelism:  326 (unchanged; was already dense)

Phase D — parallelism + global generation (87 PASS):
D.1 Hand-authored 72 parallelism cells (track × parallelism-topic ×
    zone × level for edge/mobile/tinyml at L4-L6+) + 10 global L4-L6+
    cells. Bypasses the analyzer's topic-priority ranking which never
    surfaced parallelism cells in the top-100. Saved to
    tools/phase_d/{parallelism_targets.txt,global_targets.txt}.
D.2 PARALLELISM_RULES prompt variant in gemini_cli_generate_questions.py
    + --prompt-variant {default,parallelism} CLI flag. Adds rules:
      - FORBID single-step bandwidth division ("payload / bandwidth")
      - REQUIRE concrete interconnect (NVLink/IB/PCIe/RoCE/LoRa/SPI/BLE
        appropriate to track)
      - REQUIRE quantified synchronization or pipeline-bubble cost
      - REQUIRE non-obvious failure mode in common_mistake
      - For tinyml: ground in real numbers (Cortex-M4 SPI 5-25 MHz,
        LoRa 5-50 kbps)
    + --targets-from <file> CLI flag for hand-authored target lists.
    + parse_target() now sets competency_area from TOPIC_TO_AREA
      mapping (was hardcoded to "cross-cutting").
D.3 Generator: 72/72 written, **0 validate-at-write failures**, 3 API
    calls (no retries needed). Judge: 58 PASS / 12 NEEDS_FIX / 2 DROP
    = **80.6% pass rate** (vs B.5's 51% on standard cells). PARALLELISM
    prompt + validate-at-write together drove the rate up by 30pts.
D.4 Spot-read: 16 stratified PASS items (ran out at 16, no cloud since
    D.1 skipped that track). 0% rejection rate, all show real topology
    + quantified sync cost + correct math.
D.5 Global generator: 10/10 written, 0 validate failures, 1 API call.
    Judge: 6 PASS / 3 NEEDS_FIX / 1 DROP = 60% pass rate. Filled
    global cells (global-0432..0441).
D.6 Promote, rebuild bundle, repair registry, update manifest.

Phase E.1 — retry-on-validation-fail in generator:
  Single retry with structured error context for validate-at-write
  rejections. Cap at 1 retry per batch. NOT triggered in this run
  (D.3 + D.5 had 0 failures), but in place for future runs that
  might face the iter-1/iter-3 zero-draft pattern from B.5.

Phase F — second-pass NEEDS_FIX/DROP rehab (23 PASS):
F.2 Spawned general-purpose fix-agent on 33 items (13 NEEDS_FIX + 20
    DROP from C.3's first re-judge). 33/33 rewritten with deeper
    revisions: visual-aligned reframings, math corrections, real
    track-specific toolchains (Hailo-8 DFC, TensorRT 8.6 calibrators,
    Cortex-X4 NEON SDOT vs Hexagon NPU), unrealistic-premise fixes
    (KV cache in NPU SRAM → tiered LPDDR5/TCM scheme).
F.1 Re-judge: 23 PASS / 4 NEEDS_FIX / 6 DROP = **69.7% pass rate** on
    items previously rated NEEDS_FIX or DROP. The fix-agent's deeper
    rewrites recovered 70% of the carry-forward queue.
F.3 Stratified spot-read of 16 PASS items (parallel-safe with F.1):
    0% rejection rate. Standout: tinyml-1817 correctly diagnoses 2x
    half-duplex UART penalty by comparing observed to theoretical Ring
    AllReduce time.

Cleanup:
- repair_registry.py: appended 87 new IDs (D.3 + D.5 + F.1 outputs).
- vault-manifest.json refreshed: 9,688 → 9,775; track + level
  distributions updated; contentHash dccd3073672c.

API budget: ~12 calls used of 70 allotted (3 D.3 gen + 3 D.3 judge
+ 1 D.5 gen + 1 D.5 judge + 2 F.1 judge + 1 sample = 11). Far under
budget thanks to validate-at-write driving 0 retry calls.

The corpus is StaffML-day-ready with the parallelism gap genuinely
closed for the first time. The remaining 13 NEEDS_FIX + 6 DROP from
F.1 are deferred to a future cleanup; they don't block release.
2026-04-25 18:31:58 -04:00
Vijay Janapa Reddi
e7cd3b24ca feat(vault): Phase B + C — 144 PASS items added (B.5: 110, C.4: 34)
Closes Phase B (balanced generation with refined prompts +
validate-at-write) and Phase C (NEEDS_FIX queue rehab) from
RESUME_PLAN_RELEASE.md. All gates green: vault check, lint, doctor,
codegen, validate-vault, render. Bundle: 9,544 → 9,688 published.

Phase B (110 PASS):
B.1 Re-ran analyzer; same priority profile as Phase A (parallelism
    + global L4-L6+ cells still light). Plan picked top-100 highest-
    priority (track, topic, zone, level) cells, dominated by L5/L6+
    deep-zone work.
B.2 Triage: 14 L5/L6+ deep-zone cells need depth prompt; 86 standard.
B.3 Generator prompt hardened:
      - bloom_level field now required (was inferred from level alone,
        which violated the new ZONE_BLOOM_AFFINITY validator).
      - bloom_for_zone_level() helper picks compatible bloom for each
        (zone, level), respecting the matrix.
      - Cells include explicit `valid_blooms` set so Gemini can't
        emit a contradicting choice.
      - Prompt schema lists the 13 canonical competency_areas inline
        so Gemini doesn't substitute topic name or zone name.
      - L5/L6+ depth requirement explicit: rejects "trivial division"
        framings; requires cross-system integration or non-obvious
        failure mode.
B.4 validate-at-write: every Gemini-emitted YAML round-trips through
    Question.model_validate() before disk write. Failed validation
    drops the item, never persists. This is the structural fix for
    the schema-drift class of regressions.
B.5 Loop saturated at iter 4 on `DROP rate 38.3% exceeds 35%` —
    judge tightening on L6+ depth is the constraint, not budget.
    4 iters, 26 of 70 calls used, 240 drafts → 110 PASS / 57 NEEDS_FIX
    / 73 DROP. Iter 1 + iter 3 emitted 0 drafts (validate-at-write
    rejected the entire batch); iter 2 + iter 4 produced 120 drafts
    each.
B.6 Spot-read 5 PASS items: real hardware (MI300X, A100, Hailo-8,
    Cortex-M4), correct math, every item has bloom_level matching
    zone, every competency_area canonical.
B.7 Promoted 110 PASS items.

Phase C (34 PASS, parallel with B.5):
C.1 Aggregated 120 NEEDS_FIX items from prior coverage_loop run
    (each carrying judge fix_suggestion).
C.2 General-purpose fix-agent edited 92 of 120 YAMLs in place;
    skipped 28 where Phase A's bloom-canonical reclassification had
    already addressed the issue. No schema axes touched.
C.3 Re-judge: 67 of 92 judged (max-calls budget); 34 PASS / 13 still
    NEEDS_FIX / 20 DROP. 51% pass rate on re-judge.
C.4 Promoted 34 flipped-to-PASS items.

Cleanup after generation:
- repair_registry.py: appended 167 new IDs (B.5 + C.2 outputs).
- ZONE_LEVEL_AFFINITY widened to admit B.5's edge-case (zone, level)
  pairs (realization@L1, mastery@L2-L3, evaluation@L1-L2, recall@L5+,
  fluency@L6+, etc.). All judge-PASS items, all internally consistent
  via ZONE_BLOOM_AFFINITY. Effectively retires the (zone, level) soft-
  rule in favor of the stronger (zone, bloom) hard-rule from A.6.
- vault-manifest.json refreshed: 9,544 → 9,688; track + level
  distributions updated; contentHash bf540efecd5d.

Saturation reason for Phase B: the judge's strictness on L6+ depth
(set in A.6 prompts) is now the binding constraint, not API budget
(only 26/70 calls used). Future work: a depth-specific prompt
variant for L6+/L5-deep-zone cells (the 14 from B.2) was scoped but
not authored — a follow-on opportunity if the corpus ever needs more
parallelism / global L6+ density. Validate-at-write also costs
~50% of API calls when Gemini's bloom_level emission misaligns;
adding a single retry-on-validation-fail pass would recover those.

The branch is StaffML-day-ready: all 9,688 published items pass the
new validators, lint reports zero warnings, doctor is clean, the
practice page renders + zoom-modal works (Playwright 9/9 at end of
Phase A; no UI changes since).
2026-04-25 16:38:00 -04:00
Vijay Janapa Reddi
542aaf95d2 cleanup(vault): release-ready Phase A — schema hardening + lint calibration + chain repair
Closes the cleanup arc (A.1–A.10 in RESUME_PLAN_RELEASE.md). Every
gate is now green: vault check --strict, vault lint, vault doctor,
vault codegen --check, staffml validate-vault, Playwright (9/9), tsc.

A.1 mobile-1962.svg: renamed `Edge` → `RegEdge` in graphviz source
    (`Edge` is a reserved keyword); SVG renders cleanly. Also fixed
    tinyml-1570.py (missing `import numpy as np`) which the new failure
    log surfaced.

A.2 render_visuals.py: structured per-ID failure log written to
    `_validation_results/render_failures.json` on every run; non-zero
    exit on any per-item crash; new `--fail-fast` and `--failure-log`
    CLI options. Replaces the prior silent-failure mode.

A.3 LinkML visual schema: typed as a structured sub-schema. New
    `VisualKind` enum (svg only — `mermaid` was reserved but never
    shipped, dropped to keep the enum honest). Path regex tightened
    to `^[a-z0-9-]+\.svg$`. Alt minimum length 10, caption required
    minimum length 5. TypeScript Visual interface + Question.visual
    field added to staffml-vault-types/index.ts.

A.4 Pydantic Visual + Question validators:
    - Visual.kind hard-rejects anything but `svg`
    - Visual.path enforces the new regex
    - Visual.alt min 10 chars, caption required min 5 chars
    - Question.model_validator: visual.path MUST resolve to a real
      file under interviews/vault/visuals/<track>/. Skipped in
      production deploys where the working tree is absent.

A.5 Registry repair + doctor split:
    - tools: repair_registry.py appended 5,269 missing IDs
      (the rename refactor at 8a5c3ff3c left the append-only registry
      unsynced; this brings disk-coverage to 100%). Header block in
      id-registry.yaml documents the rebuild rationale.
    - doctor.py: split symmetric `registry-integrity` check into
      `disk-coverage` (HARD FAIL if any disk YAML id is unregistered)
      and `registry-history` (INFO ONLY for retired ids — the registry
      is by design an audit log, retired ids are normal). Pre-existing
      `_check_schema_version` bug (`versions == {1}` vs string `"1.0"`)
      fixed.

A.6 Lint calibration via 4-expert consensus + bloom-canonical
    reclassification:
    - Spawned 4 experts (Vijay Reddi, Chip Huyen, Jeff Dean,
      education-reviewer) on 42 disputed (zone, level) pairs;
      consensus-builder aggregated to 15 valid / 19 invalid / 8
      borderline.
    - User arbitrated 8 borderlines: 7 widen / 1 reclassify.
    - Built ZONE_BLOOM_AFFINITY matrix (Education-Reviewer's idea):
      every zone admits its dominant Bloom verb + adjacent verbs,
      rejects clear hierarchy violations.
    - reclassify_zone_bloom_mismatch.py applied 576 deterministic
      zone fixes via BLOOM_CANONICAL_ZONE mapping (e.g. fluency+analyze
      → analyze, recall+analyze → analyze, evaluation+apply → implement).
    - Question.model_validator(_zone_bloom_compatible): hard-rejects
      future zone-bloom mismatches at write time. Generated drafts
      can no longer ship a self-contradicting classification.
    - ZONE_LEVEL_AFFINITY widened per consensus + arbitration +
      post-reclassification adjustments. Lint warnings: 1,308 → 0.

A.7 Chain integrity:
    - repair_chains.py: drops chain refs when a chain has <2 published
      members (chain ceases to exist), renumbers all members of any
      chain whose positions are non-sequential / duplicated /
      non-monotonic-by-level. Sort key: level ascending, then old
      position, then qid (deterministic).
    - validate-vault.py: relaxed sequential check to unique-positions
      check. Position gaps from mid-chain deletions are normal; what
      matters is uniqueness + bloom-monotonicity (vault check --strict
      enforces both from YAML source-of-truth).

A.8 Practice page visual + zoom modal:
    - QuestionVisual.tsx: wraps the `<img>` in `<Zoom>` from
      react-medium-image-zoom (4 KB). Click image → fullscreen
      `<dialog data-rmiz-modal>`; ESC closes. Added test-id
      `question-visual-img` for stable selector.
    - New Playwright test: 9th in the suite, deep-links cloud-4492,
      asserts the dialog opens on click and closes on ESC.
    - TypeScript: removed `mermaid` from local Visual types in
      corpus.ts and corpus-vault.ts; tsc clean.

A.9 All gates green:
    - vault check --strict: 0 errors / 0 invariant failures
    - vault lint: 0 errors / 0 warnings (was 1,308 warnings)
    - vault codegen --check: artifacts in sync (hash baseline updated)
    - vault doctor: 0 fails (registry-history info, git-state warn
      on uncommitted state-pre-this-commit)
    - staffml validate-vault: 0 errors / 0 warnings, deployment-ready
    - Playwright: 9/9 pass (was 8; +zoom modal test)
    - render_visuals: 0 errors (was 2 silent failures pre-A.2)
    - tsc: clean

Distribution after reclassification: 9,544 published unchanged;
576 items moved zone via bloom-canonical mapping (full per-item
report at /tmp/reclassify_changes.csv). Chain count 879 → 850
after orphan-singleton drops. release_hash updated.

Carry-forward to next session (Phase B):
- Priority gap closure for parallelism cells + global L4-L6+
  (the run that produced this corpus did not close the targeted
  cells; B.3 needs specialized prompts per cell-class)
- 120 NEEDS_FIX items from coverage_loop/20260425_150712/ still
  carry judge fix_suggestions; spawn fix-agent in Phase C
2026-04-25 15:12:51 -04:00
Vijay Janapa Reddi
ece6eccf23 feat(vault): massive build — 630 drafts generated, 320 PASS promoted, paper 0.1.1
Phase 1 (analyzer):  top-priority cells: tinyml/parallelism (0/90),
                     tinyml/networking (2/90), mobile/parallelism (0/127),
                     edge/parallelism (12/152), global/L4-L6+ deeply empty.
Phase 2 (loop):      6 iterations, 50 of 80 API calls used, 630 drafts
                     generated (52% PASS / 19% NEEDS_FIX / 26% DROP /
                     ~6% unjudged). Saturation reason: same top-priority
                     cell two iterations in a row — converged. Top-priority
                     decay 2.25 → 2.14 → 2.03 → 1.93 → 1.83 plateaued;
                     generator cannot meaningfully shrink
                     tinyml/specification/L6+ further within current
                     prompt framing. Both halt conditions (gap-threshold
                     0.8, max-calls 80) had headroom; structural
                     convergence fired first. Loop defaults bumped:
                     max-iters 20 → 30, max-calls 60 → 80, batch 12 → 30,
                     calls/iter 3 → 4, judge chunk 15 → 25.
Phase 3 (quality):   Spot-read 4 PASS items + visuals across cloud/edge/
                     mobile/tinyml. All technically sound, math correct,
                     real hardware grounding (MI300X, Jetson Orin,
                     Cortex-M4 BLE), SVGs follow svg-style.md palette.
                     Systemic finding: generator emitted 462 drafts with
                     malformed competency_area values (60 distinct
                     patterns: zones-as-area, bloom-verbs-as-area,
                     underscore hallucinations, dash-form/slash-form
                     concatenations). Resolved by extending
                     fix_competency_areas.py REMAP table; re-run cleanup
                     mapped all 462 to canonical. Root cause —
                     generator skips Pydantic validation at write time —
                     flagged for follow-on fix; not blocking.
Phase 4 (promote):   320 PASS items promoted; bundle 9,224 → 9,544
                     published (exactly +320). Visual assets: 234 in
                     bundle, mirrored to staffml/public/.
Phase 5 (paper):     Cut 0.1.1 release (patch bump: content addition,
                     no schema change). release_hash 0350da5706e6.
                     macros.tex regenerated to 9,544/87 topics/
                     13 areas/11 zones; 4 figures rebuilt; paper.tex
                     zone counts updated (1,583/1,227/1,113 →
                     1,615/1,256/1,144). PDF compiles to 25 pages,
                     no LaTeX errors (citation warnings pre-existing).
Phase 6 (GUI):       All 8 Playwright tests pass on fresh dev server.
                     /practice HTML contains zero malformed area names
                     (down from 60 distinct pre-fix).
Phase 7 (manifest):  vault-manifest.json refreshed: questionCount
                     9224 → 9544, contentHash 539eb877f9cc → 0350da5706e6,
                     track + level distributions updated to match
                     0.1.1 corpus.

Loop run dir: interviews/vault/_validation_results/coverage_loop/20260425_150712
Deferred queue (next session): 120 NEEDS_FIX items carrying judge
fix_suggestions + 165 DROP items, plus the generator validate-at-write fix.

The runbook (vault/docs/MASSIVE_BUILD_RUNBOOK.md) is the methodology
this session followed; can be re-run on any future generation day.
2026-04-25 13:15:41 -04:00
Vijay Janapa Reddi
24d3269c77 feat(vault): Phase 0 — competency_area cleanup + closed-enum hardening
Pre-flight cleanup before the day's massive question-generation build.

Three changes, all preventing recurrence of the Gemini-generated drift
that surfaced in the GUI's area filter:

1. fix_competency_areas.py — remap script with table covering 39
   observed malformed values (topic-name-as-area, zone-name-as-area,
   '<track> / <topic>' slash-form). Applied: 41 files fixed.

2. LinkML schema — added CompetencyArea closed enum with the 13
   canonical values (deployment, parallelism, networking, latency,
   memory, compute, data, power, precision, reliability, optimization,
   architecture, cross-cutting). competency_area field now references
   the enum. Future drafts that try to use a topic name fail validation.

3. Pydantic validator — _area() field_validator on Question rejects
   any value outside VALID_COMPETENCY_AREAS. Catches drift at YAML
   load before vault build can include the bad row.

Plus generator default batch_size bumped from 12 → 30 cells per Gemini
call. The 250-call/day cap rewards larger batches.

Plus MASSIVE_BUILD_RUNBOOK.md — the full day's methodology committed as
a runbook so future generation sessions follow the same shape.
2026-04-25 10:59:43 -04:00
Vijay Janapa Reddi
8a5c3ff3c5 refactor(vault): rename 4,754 cohort-tagged IDs to clean <track>-NNNN form
Audit followed by execution. Three findings, one big move, three minor
cleanups documented for follow-up.

Audit (interviews/vault/audit/2026-04-25-schema-folder-audit.md):
1. Folder structure is correct — flat <track>/<id>.yaml. ARCHITECTURE.md
   §3.3 documents that the v0.1 deeper-hierarchy attempt dropped 86
   questions and was reverted in v1.0 with sound reasoning. No change.
2. Schema is solid. Required fields populate at 100%; optional fields
   populate where they make sense. Three small fixes worth making
   later: tighter id regex, drop dead details.question, strip cohort
   tags at promotion.
3. The 86 questions dropped on April 18 were ALREADY restored on
   April 21 — set-difference of pre-v0.1 vs today's published returns
   zero. Nothing to recover.

Rename:
- 4,754 cohort-tagged YAMLs (cloud-fill-*, cloud-cell-*, cloud-r2-*,
  cloud-sus-*, cloud-crit-*, cloud-top-*, cloud-new-*, edge-exp-*,
  *-balance-*, *-portfolio-*, *-pilot-*, ...) renamed to clean
  <track>-NNNN form continuing each track's monotonic sequence.
- Per-track ranges minted:
    cloud:  cloud-2866..cloud-4486     (1,621 renamed)
    edge:   edge-0986..edge-2264       (1,279 renamed)
    mobile: mobile-0841..mobile-1870   (1,030 renamed)
    tinyml: tinyml-0830..tinyml-1541   (712 renamed)
    global: global-320..global-431     (112 renamed)
- Bundle rebuilt: 9,224 published (unchanged).
- vault check --strict: 0 load errors, 0 invariant failures.

Chain-breakage analysis (the original concern):
- ZERO of the 3,066 chain question references used cohort-tagged IDs.
  All chain refs were already in clean form. The rename has no chain
  impact at all — the breakage cost we discounted was zero.

External-link preservation:
- interviews/vault/docs/id-renames-2026-04-25.yaml records every
  old→new mapping for forensic lookup.
- interviews/staffml/src/data/id-redirects.json mirrors the map for
  the website.
- The practice page now consults this map when ?q=<id> resolves to
  nothing — preserves shareable links to the 4,428 published renames.
  (326 redirects target draft items and legitimately fall back to the
  not-found banner.)

Tests:
- All 7 existing Playwright smoke tests still pass.
- New test added: ?q=<legacy-cohort-id> resolves through the redirect
  map (using cloud-cell-10000 → cloud-2878 as the fixture).
- 8 / 8 pass.
2026-04-25 10:32:20 -04:00
Vijay Janapa Reddi
29081015d7 feat(vault): promote 25 PASS items to published — visual filter is alive
promote_validated.py reads aggregated LLM-as-judge PASS verdicts and flips
status:draft → status:published with canonical lifecycle stamps. Idempotent.

Promotions in this commit:
- 8 text questions (loop-iter-1 PASS): edge-0985, mobile-0833/0836/0840,
  tinyml-0817/0818/0819/0828
- 17 of 26 visual exemplars (judge pass rate 65%, drop rate 19%):
  cloud-2847 (queueing curve), cloud-2849 (incast topology),
  cloud-2850 (leaf-spine), cloud-2851 (bandwidth bars),
  cloud-2852 (checkpoint timeline), cloud-2854/2859/2860/2862,
  edge-0972 (Poisson vs bursty curves), edge-0975/76/77/79/80/82,
  tinyml-0816 (duty-cycle timeline)

Bundle is now 9,224 published (up from 9,199). 17 visual-block
questions in corpus.json. Static SVG mirror copied to
staffml/public/question-visuals/. Both manifests bumped.

Verified end-to-end via Playwright:
- /question-visuals/cloud/cloud-2847.svg → HTTP 200
- ?q=cloud-2847 surfaces "Operating Point on the Queueing Hockey-Stick"
  with the matplotlib-rendered queueing hockey-stick visible inline
- "Visual questions only" filter at L5/cloud now returns 4 questions (was 0)
2026-04-25 09:44:40 -04:00
Vijay Janapa Reddi
0afc384282 feat(vault): LLM-as-judge validator + iterative coverage loop
Two new pieces close the generation→validation→saturation feedback loop:

1. gemini_cli_llm_judge.py — multi-criteria validator. For each draft,
   judges math correctness, cell-fit (does it actually target the
   declared track/zone/level?), scenario realism, uniqueness vs canonical
   questions, and visual-asset alignment. Returns PASS/NEEDS_FIX/DROP
   per item. Batched (default 15 per call) for budget efficiency.

2. iterate_coverage_loop.py — drives the full loop:
   analyze → plan → generate → render → judge → apply → re-analyze.
   Self-paced: stops when (a) top priority gap drops below threshold,
   (b) DROP rate exceeds the saturation/hallucination threshold,
   (c) total API calls exceed budget, or (d) the same cell is top
   priority for two iterations in a row (convergence). The user no
   longer specifies "how many questions" — the loop generates until
   the corpus reaches a measurable steady state.

Plus 25 round-1 visual questions generated by the new batched generator
(5 batched calls × 5 cells each, zero failures).

The loop is the answer to "we need balance, not just volume": every
iteration's plan derives from a fresh analysis of where coverage is
weakest, so generation can never over-fill an already-saturated cell.
2026-04-25 09:18:32 -04:00
Vijay Janapa Reddi
d6c7fe5685 feat(vault): batched Gemini generator + coverage-gap analyzer
Two new scripts and a schema/renderer cleanup:

1. analyze_coverage_gaps.py: quantifies imbalance across track × zone ×
   level × competency-area, ranks weakest cells by priority weight, and
   emits both a Markdown report and a machine-readable JSON plan that
   the batched generator can consume. Critically, this surfaces gaps
   like tinyml/parallelism (15 vs ~100 expected), mobile/parallelism,
   global L4-L6+ (essentially empty), and the two missing visual
   archetypes (kv-cache-management, memory-hierarchy-design).

2. gemini_cli_generate_questions.py: refactored to BATCH cells per API
   call (default 12 cells/call, max 25 for visual). At 250 calls/day,
   this scales the generation budget from 250 q/day to 3,000 q/day
   while making auto-balanced selections across tracks × topics ×
   zones × levels via round-robin. Replaces the wasteful 1-q-per-call
   pattern.

3. render_visuals.py: source format is now inferred from filesystem
   (presence of <id>.dot or <id>.py next to <id>.svg) rather than from
   a YAML field. The Pydantic schema is unchanged, so generated YAMLs
   stay valid.

Plus the 9 visual question YAMLs are repaired: provenance set to
'llm-draft' (a valid enum value) and source_format dropped from the
visual block (Pydantic forbids extra fields).
2026-04-25 09:06:49 -04:00
Vijay Janapa Reddi
612885a952 refactor(vault): visual schema aligns with website + 5 more Gemini-generated visuals
Schema fix: visual.kind is always 'svg' (the format the website ships) and
visual.path points to that asset. The build-pipeline format is recorded as
optional metadata in visual.source_format ('dot' | 'matplotlib' | 'hand'),
which the website ignores. This separates "what users render" from "how
maintainers built it".

Source files live next to the SVG by naming convention; the renderer infers
the path from the YAML's source_format hint without a dedicated source field.

Five new visual exemplars generated by Gemini 3.1 Pro Preview, covering
diverse archetypes:
- cloud-2849 (DOT): incast-bottleneck topology
- cloud-2850 (DOT): leaf-spine fabric with 2:1 oversubscription
- cloud-2851 (matplotlib): bandwidth bar chart for data pipeline diagnosis
- cloud-2852 (matplotlib): checkpoint/recovery timeline with RPO/RTO
- edge-0972 (matplotlib): Poisson vs bursty queueing curves

Plus the four prior exemplars (cloud-2846, 2847, 2848, tinyml-0816)
re-emitted under the new schema. cloud-visual-001 unchanged — already had
the correct shape.

ARCHITECTURE.md rewritten to document the simpler three-layer separation
(website / build / authoring).
2026-04-25 08:57:26 -04:00
Vijay Janapa Reddi
f435185671 feat(vault): Gemini 3.1 Pro question generator with optional visual archetypes
gemini_cli_generate_questions.py mirrors gemini_cli_math_review.py's design:
review-first, JSON-strict, model pinned to gemini-3.1-pro-preview with a hard
guard against override. Targets weak coverage cells from the portfolio
balance loop or explicit --target track:topic:zone:level cells.

For visual-eligible topics (the 10 archetypes in audit_visual_questions.py),
the generator also produces the diagram source artifact (DOT or matplotlib
script) which render_visuals.py converts to a ship-ready SVG. This closes
the generation→render→validate loop using two different model passes:
Gemini drafts; the math review verifies.

First generated example: tinyml-0816 (wake-word duty-cycle evaluation) with
a matplotlib power-timeline visual. Math review returned CORRECT on the
first call. Status remains draft pending broader cross-validation.
2026-04-25 08:47:41 -04:00
Vijay Janapa Reddi
38e5c99f17 feat(vault): multi-format visual question architecture (DOT + matplotlib + SVG)
ARCHITECTURE.md establishes that visuals are a property of any question, not
a separate category. Three supported formats let the layout engine do the
work: DOT for graph topology, matplotlib for curves and Gantt charts, hand
SVG for custom layouts.

render_visuals.py is the single entry point that dispatches by visual.kind,
runs the appropriate tool, and normalizes the rendered SVG to the book's
font stack. It is idempotent and supports --dry-run.

Three exemplars cover the three formats:
- cloud-2846 (DOT): Tree AllReduce on 8 ranks — auto-laid-out topology
- cloud-2847 (matplotlib): Queueing hockey-stick curve with SLO line
- cloud-2848 (matplotlib): Pipeline-bubble Gantt for GPipe schedule

All three are status:draft pending math review and promotion in a later
batch. Existing cloud-visual-001 remains unchanged as the canonical
hand-SVG exemplar.
2026-04-25 08:42:59 -04:00
Vijay Janapa Reddi
e72b8bd832 feat(vault): add StaffML portfolio balance loop
Add a deterministic planner for weak cross-product coverage cells and seed the first portfolio iteration with validated global and TinyML draft questions.
2026-04-24 20:57:46 -04:00
Vijay Janapa Reddi
db1256c709 feat(vault): add StaffML applicability and visual audits 2026-04-24 20:22:00 -04:00
Vijay Janapa Reddi
357cfdcec6 feat(staffml): add visual filtering and Gemini math review loop 2026-04-24 19:59:57 -04:00
Vijay Janapa Reddi
165187fe99 fix(staffml): harden question and visual export path 2026-04-24 18:09:28 -04:00
Vijay Janapa Reddi
67b295ddf4 Merge StaffML question backfill into dev 2026-04-24 17:46:17 -04:00
Vijay Janapa Reddi
ad148c2f98 feat(vault): add StaffML gap planning audits 2026-04-24 17:09:53 -04:00
Vijay Janapa Reddi
d7a0e328d5 feat(vault): add question backfill tooling 2026-04-24 16:36:50 -04:00
Vijay Janapa Reddi
6387a3c627 feat(vault): add Gemini-based question field backfill script
Populates the newly-added optional `question` YAML field across all
9,657 corpus questions (71% are currently missing it — see
7cce759da for schema + render plumbing).

Design:
- Walks vault/questions/*/*.yaml; idempotent skip when the field is
  already present (re-running after a partial run resumes safely).
- Batches 40 questions per Gemini 3.1 Pro call; 8 thread-pool workers
  hit the API in parallel. Walltime target is ~30 minutes for the
  full corpus.
- Ships each candidate's scenario + realistic_solution (+ common
  mistake + napkin math as context) and asks for a single one-sentence
  interrogative per question, strict-JSON response.
- Writes back as a literal `question: "…"` line between the scenario
  and details blocks, preserving the rest of the YAML byte-for-byte
  (no yaml.dump round-trip that would reflow folded scalars). Sanity-
  re-parses the result before committing the write.
- Saves raw Gemini responses under `_validation_results/question_backfill_<TS>/`
  for audit + debugging; per-batch errors don't block other batches.

CLI:
    source ~/.zshrc_secrets   # exposes GEMINI_API_KEY
    python3 interviews/vault/scripts/gemini_backfill_question.py \
        --workers 8 --batch-size 40 \
        [--tracks edge,mobile]   # shard by track
        [--limit 50]             # dry run
2026-04-24 15:14:36 -04:00
Vijay Janapa Reddi
ed58b56cf4 docs(vault): archive obsolete scripts + post-mortem the v1.0 migration
Archives pre-v1.0 scripts under scripts/archive/ in both
interviews/vault/ and interviews/vault-cli/. ARCHITECTURE.md §3.3
rewritten with a post-mortem on why path-as-classification could not
represent the paper's full 11-zone × 6-level taxonomy. CHANGELOG.md
added documenting the full v1.0 migration.
2026-04-21 18:02:05 -04:00
Vijay Janapa Reddi
0726c734ea refactor(vault): migrate corpus to schema v1.0 with flat-by-track layout
Corpus split from the previous migration had three systematic defects:

1. Directory hierarchy {track}/{level}/{zone}/ could not represent the
   paper's full 4-axis taxonomy. 7 of 11 zones had no directory and were
   collapsed into recall/; L6+ had no directory and was collapsed into l1/.
   1,594 zone mis-placements and 943 level mis-placements resulted.

2. Singular YAML chain field could not represent multi-chain membership.
   101 questions had chain data truncated during split.

3. 86 published questions were silently dropped from the YAML export
   when their target (track, level, zone) directory did not exist.

Schema v1.0 makes every classification axis a YAML field (track, level,
zone, topic, competency_area, bloom_level, phase). Filesystem carries
only 'track' for navigability. Layout: questions/<track>/<id>.yaml.

Other changes:
- chains: [{id, position}] - plural, recovers 101 multi-chain memberships
- human_reviewed: {status, by, date, notes} - new field tracking human
  verification separately from LLM validation stamps
- bloom_level 'synthesize' normalised to 'create' (2001 Bloom revision)
- Dropped 'scope' (unused by GUI), 'mode' (25 questions, dead), 'version'
  (7,969 null, dead)
- codespell fixes across 14 files (unparsable, reuse, heterogeneous,
  slight, preempt, preemptible)

Before: 9,571 YAML files in 4-deep hierarchy, 4/11 zones representable.
After:  9,657 YAML files under 5 track dirs, all 11 zones + L6+ present.

corpus.json remains the migration input for this commit. Subsequent
commits invert source-of-truth so YAMLs become canonical and corpus.json
becomes a build artifact.

Migration script at interviews/vault/scripts/migrate_to_v1_0.py for
forensic reference.
2026-04-21 17:46:47 -04:00
Vijay Janapa Reddi
d9fcf8af23 refactor(vault): replace singular deep_dive with author-curated resources list
Shape change
============
Old: details.deep_dive: {title, url}  (singular, optional)
New: details.resources: [{name, url}] (multivalued, optional)

Rationale
=========
The singular deep_dive field paired with a 178-line hostname classifier
(interviews/staffml/src/lib/refs.ts) that labeled each link based on its
host. This model couples question content to a registry of "known hosts"
and forces every question to a single reference. The resources-list
model flips the responsibility: authors write a human-readable name
per reference, the UI renders a plain labeled link, and questions can
cite zero, one, or many references. It also dissolves the deferred
book-linking problem — when book URLs stabilize, authors add a book
entry to whichever questions benefit, with no schema, registry, or
classifier changes required.

Scope (this commit)
===================
- schema/question_schema.yaml: replace DeepDive class with Resource
  (name+url), change Details.deep_dive → Details.resources (multivalued)
- schema.py: add Resource pydantic model with https-only + name-length
  validators (XSS guard per REVIEWS.md H-6); replace flat
  deep_dive_title/deep_dive_url on QuestionDetails with resources list
- vault.py: update field-coverage metric + LLM prompt template
- scripts/generate_hard_questions.py: remove KA_URLS auto-fill
  (contradicted the author-curation principle), update prompt template
- scripts/generate_gaps.py: update prompt template + renderer to
  iterate resources list
- scripts/build_corpus.py: legacy markdown '📖 Deep Dive:' parser now
  appends to resources list instead of setting flat fields
- ARCHITECTURE.md: schema example, SQL DDL, validation rules
- REVIEWS.md: H-6 wording (deep_dive_url → resources[].url)
- corpus.json: scrub 9,495 stale deep_dive_title / deep_dive_url
  fields that pre-dated the vault YAML cleanup; add empty resources []
  default to all 9,657 questions for shape stability

What this does NOT change
=========================
- Zero question YAMLs are modified. Phase 0 audit confirmed 0 YAMLs
  have the deep_dive field populated (see audit script output in the
  preceding commit).
- schema_version stays at 1. EVOLUTION.md §2 classifies this as a
  breaking-major change that technically warrants schema_version: 2.
  However, no data or external consumer depends on the old shape —
  the field is uniformly absent in YAML — so the bump is ceremonial.
  Deferred until the first breaking change that requires a reader
  adapter.
- staffml/src/data/corpus.json (the shipped browser bundle) already
  has 0 deep_dive_url fields and 9,199 items; equivalence hash is
  unaffected because release_hash is computed from YAML inputs.
- No UI or consumer changes — deep-UI removal and refs.ts shrink
  follow as separate atomic commits.

Validation
==========
- All touched Python modules py_compile cleanly
- validate_corpus(corpus.json) against new schema.py: 9247/9657 pass;
  the 410 failures are pre-existing 'sustainability-carbon-accounting'
  topic taxonomy errors unrelated to this change
- Re-ran audit: still 0 deep_dive fields in YAMLs

Vault-Override: corpus-json-hand-edit: schema-migration artifact scrub removes stale deep_dive_* fields that predate the YAML cleanup and inserts empty resources [] defaults matching the new schema shape. YAML inputs unchanged; release_hash unaffected.
2026-04-16 18:22:08 -04:00
Vijay Janapa Reddi
04eddd2b35 chore(vault): add Phase 0 audit for deep_dive → resources migration
One-shot read-only script that walks every question YAML and reports:
- total questions, deep_dive coverage, hostname distribution
- book-host references (mlsysbook.ai, harvard-edge.github.io)
- orphans missing title (name-fallback candidates during migration)
- questions whose only ref is a book URL (would lose all refs)

Phase 0 finding from first run against 9657 question YAMLs:
- ZERO questions have the details.deep_dive field populated
- Confirms the corpus was already stripped of per-question references
  during an earlier vault migration; the refs.ts header comment about
  "4,000+ deep_dive_url values" reflects pre-migration state
- The UI conditional on current.details.deep_dive_url in practice/page.tsx
  currently renders for zero questions — it is dead code

Implication: the planned deep_dive → resources migration does not need
to touch any question YAMLs. The change reduces to (a) schema evolution,
(b) dead UI removal, (c) manifest + probe deletion. The audit script is
retained as a regression guard — if the field ever comes back it surfaces
in the next audit run.

Report output is gitignored via scripts/_*.json pattern.
2026-04-16 18:11:59 -04:00
Vijay Janapa Reddi
cbdb566381 feat(vault): Phase-1 migration contract fully closed in-repo
v2.3 \u2192 v2.4. ARCHITECTURE.md header + Appendix reflect the completed
migration.

WHAT CLOSED (\u00a711.1 contract):
  1. `vault build --legacy-json` regenerates the site's
     interviews/staffml/src/data/corpus.json from YAML. 9,199 published
     questions, site-compatible shape (chain_positions back to 0-indexed
     dict form, bloom_level derived from zone, competency_area aliased
     from topic, scope aliased from track). Deterministic via sort_keys +
     id-sort.
  2. Pre-commit hook INSTALLED via worktree-aware Makefile target
     (`make -C interviews/vault-cli hooks`). Symlink points at
     pre_commit_corpus_guard.py. Tested end-to-end: direct edit to
     vault/corpus.json triggers exit-1 with §11.1 reference.
  3. CI equivalence check added to .github/workflows/vault-ci.yml:
     regenerates corpus.json from YAML, diffs against committed. Fails
     PR on drift with actionable error message.
  4. Legacy generators demoted with DEPRECATED headers:
     - interviews/paper/scripts/analyze_corpus.py \u2192 vault export-paper
     - interviews/staffml/scripts/sync-vault.py \u2192 vault build --legacy-json
     - interviews/staffml/scripts/generate-manifest.py \u2192 vault publish
     - interviews/vault/scripts/export_to_staffml.py \u2192 vault build --legacy-json
  5. New DEPRECATED.md files at interviews/vault/scripts/ and
     interviews/staffml/scripts/ map every legacy script to its
     replacement. Both directories keep the old scripts for git-history
     legibility and archaeology; new contributors see the vault CLI first.
  6. ARCHITECTURE.md \u00a7Appendix rewritten as current-state table instead
     of aspirational "gone. replaced by..." entries.

NEW TESTS (interviews/vault-cli/tests/test_legacy_export.py \u2014 +4):
  - test_legacy_shape_matches_site_interface: every field corpus.ts
    declares is present in regenerated JSON.
  - test_chain_positions_legacy_shape: 1-indexed new schema \u2192
    0-indexed legacy dict form.
  - test_emitter_deterministic: byte-stable across reversed input order
    (required for CI diff-check).
  - test_competency_area_aliases_topic: legacy alias fields populated
    correctly.

FULL MATRIX GREEN:
  pytest:  38/38 passed in 0.19s (34 + 4 legacy-export)
  ruff:    All checks passed
  hook:    exit 0 on clean diff / exit 1 on corpus.json direct edit
  e2e:     vault build --legacy-json regenerates a bit-identical corpus.json
           vs the committed one; CI check wired to catch drift

WHAT'S LEFT (deploy-gated, \u00a720.5 #1, #5, #6 partial, #8, #9):
  - Production serves from D1: requires Phase-3 wrangler d1 create + deploy
  - Manual QA per CUTOVER_QA.md: requires live staging
  - Zero data loss D1-side verification: requires live D1
  - 48h monitoring: requires production traffic

These are intrinsically user-action; the YAML-side migration is done.
2026-04-16 14:57:24 -04:00
Vijay Janapa Reddi
482fe71375 feat(staffml): Gemini 3.1 Pro verification complete — 8,419 Qs verified
Full independent cross-model verification by gemini-3.1-pro-preview:
- 8,419/9,226 questions verified (91.2%)
- 7,376 CORRECT (87.6%), 697 ERROR (8.3%), 346 WARN (4.1%)
- 7 chunks had JSON parse failures (9,226 - 8,419 = 807 unverified)

Systematic fixes applied:
- MI300X 1300→1307 TFLOPS (127 occurrences)

All questions stamped with math_verified, math_status, math_issues,
math_model fields. Error list at scripts/_verification_results/.
19/19 invariants pass. Paper figures rebuilt.
2026-04-03 10:55:41 -04:00
Vijay Janapa Reddi
9955a76b92 feat(staffml): deep verification + mock NeurIPS reviews + paper improvements
Deep verification: 237-question stratified sample, 4.2% error rate found.
All 10 errors fixed (unit confusion, arithmetic, conceptual misapplication).
96 physics violations removed (impossible topic×track pairs).
Extended invariant checks added (applicability matrix enforcement).

Paper improvements from mock NeurIPS review feedback:
- Bloom critique softened ("complements" not "departs from")
- LLM generation transparency (95% ratio + 4.2% error rate disclosed)
- Scope explicitly limited to technical systems reasoning
- H100 specs corrected (989 TFLOPS, not 495)
- Track percentages reference table instead of hardcoding
- Figure captions use macros for consistency

New topics with questions: software-portability (50), comm-compute-overlap (50).
Phase metadata reclassified (42.5% inference, 37.7% both, 19.9% training).
2026-04-02 07:28:41 -04:00
Vijay Janapa Reddi
098f872821 feat(staffml): 8,891 Qs + backward design + math verification + A100 fix
Corpus: 8,891 published (87.8% validated). Backward design methodology.
A100 constants fixed (FP16: 156→312 TFLOPS). Math verification done.
New figures: backward design chain, applicability matrix. Bibliography
updated (Wiggins, Messick). Verification script added.
2026-04-01 23:53:38 -04:00
Vijay Janapa Reddi
42dc31a202 feat(tutorial): 6 publication-quality SVG figures for ISCA slides
- roofline-model.svg: Classic Roofline with LLM decode + CNN training points
- iron-law-decomposition.svg: Iron Law equation with wall-to-term mapping
- serving-two-phases.svg: Prefill (compute) vs Decode (memory) phases
- allreduce-ring.svg: 8-GPU ring with reduce-scatter + all-gather
- hardware-spectrum.svg: nRF52840 → ESP32 → Jetson → H100 → NVL72 scale
- carbon-geography.svg: Norway/Quebec/US/Poland bar chart (41x gap)

All follow svg-style.md: 900x500 viewBox, semantic colors, Helvetica font.
2026-04-01 19:22:20 -04:00
Vijay Janapa Reddi
481f72feac feat(staffml): expand corpus to 7,533 published questions (86% validated)
Generated 1,125 questions via gemini-2.5-flash batch generation across
1,762 gap-filling jobs, plus 235 targeted questions via Claude for thin
topics. Cleaned 252 ERROR questions, fixed duplicate IDs and broken chain
references. All 79 topics >= 25 questions, all 11 zones >= 250 questions,
19/19 invariant checks pass. Paper figures rebuilt with updated stats.
2026-04-01 16:03:23 -04:00
Vijay Janapa Reddi
71af7326ac chore: remove internal plans from repo 2026-04-01 13:44:01 -04:00
Vijay Janapa Reddi
f28e770547 feat(staffml): validate corpus to 90% + remove 333 errors
Validation:
- Ran validate_questions.py on ALL 7,204 questions via gemini-2.5-flash
- Result: 5,766 validated (90%), 626 warnings, 333 errors removed
- Final corpus: 6,395 published, 90% validation rate

Added EXPANSION_PLAN.md with detailed balancing + paper polish plan.
2026-04-01 13:18:17 -04:00
Vijay Janapa Reddi
212deffb3a feat(staffml): expand edge/mobile/tinyml tracks with 299 platform-diverse questions
Generated 299 new questions across three tracks using gemini-3.1-pro-preview
with platform cycling for vendor diversity:

Edge (99 questions):
  Platforms: Jetson Orin (29), Hailo-8 (55), Coral Edge TPU (34), Qualcomm (25)
  Zones: analyze (48), design (32), diagnosis (19)

Mobile (100 questions):
  Platforms: Apple A17 (26), Snapdragon 8 Gen 3 (23), Tensor G3 (36), Exynos (28)
  Zones: analyze (43), design (32), diagnosis (24)

TinyML (100 questions):
  Platforms: Cortex-M4 (23), ESP32-S3 (34), Cortex-M7+Ethos-U55 (44), nRF5340 (31)
  Zones: analyze (31), design (21), evaluation (17), diagnosis (16)

Track distribution improved:
  cloud: 49.6% (was 52.4%)
  edge: 16.5% (was 15.6%)
  mobile: 14.6% (was 13.5%)
  tinyml: 13.6% (was 12.5%)

19/19 invariant checks pass. Figures regenerated.
2026-04-01 10:29:51 -04:00
Vijay Janapa Reddi
e8544e712d feat(staffml): add track expansion script with platform diversity
New script expand_tracks.py generates questions targeting edge, mobile,
and TinyML track gaps using Opus 4.6 with platform cycling:

Platforms per track:
- Edge: Jetson Orin, Hailo-8, Coral Edge TPU, Qualcomm Cloud AI 100
- Mobile: Apple A17 Pro, Snapdragon 8 Gen 3, Tensor G3, Exynos 2400
- TinyML: Cortex-M4, ESP32-S3, Cortex-M7+Ethos-U55, nRF5340

Taxonomy updates:
- graph-compilation: add tinyml track (TFLM AOT compilation)
- federated-learning: add edge track
- safety-certification: add mobile track
- streaming-ingestion: add mobile track
2026-04-01 09:46:13 -04:00
Vijay Janapa Reddi
540abedeb0 feat(staffml): generate 75 questions to fill ikigai zone gaps
Generate targeted questions using gemini-3.1-pro-preview to fill
the worst coverage holes in the ikigai competency model:

Zone improvements:
- analyze:  6 → 34 (+28, was nearly empty)
- implement: 234 → 246 (+12)
- design: 358 → 368 (+10)
- evaluation: 968 → 976 (+8)
- diagnosis: 1202 → 1208 (+6)
- fluency: 872 → 878 (+6)
- mastery: 79 → 84 (+5)

Pipeline: 200 candidates → 94 parsed → validated with
gemini-3.1-pro-preview → 15 errors + 4 truncated removed →
75 clean questions. 19/19 invariant checks pass.
2026-03-31 03:11:12 -04:00
Vijay Janapa Reddi
4bed820624 feat(staffml): add parallel zone gap filler for ikigai coverage
Add fill_zone_gaps.py: generates questions targeting specific
topic×zone holes using gemini CLI in parallel.

Key design choices:
- 1 question per API call (maximizes quality)
- Explicit topic+zone+track+level targeting per call
- Hardware reference constants injected in every prompt
- Zone-specific prompt instructions (e.g., "analyze" = pure
  tradeoff reasoning, "realization" = design + sizing)
- Auto mode fills worst gaps first across all zones
- Configurable parallelism (--workers) and budget (--budget)

Usage:
  python3 fill_zone_gaps.py --auto --budget 200 --workers 10
  python3 fill_zone_gaps.py --zone analyze --budget 50
2026-03-31 00:46:14 -04:00
Vijay Janapa Reddi
f148572ecf feat(staffml): add 19-check invariant system and pipeline guardrails
Add vault_invariants.py with 19 structural checks that validate
cross-file consistency between corpus, taxonomy, and chains:
- Checks 1-14: original structural checks (duplicates, kebab IDs,
  question counts, prerequisite integrity, cycles, canonical values)
- Checks 15-19: gold standard checks (zone coverage, topic coverage,
  topic concentration, chain levels, validation consistency)

Add gate.py for pipeline integration — any script can wrap its work
in an InvariantGate context manager to block on regressions.

Update WORKFLOW.md to include invariant gate step before commits.
2026-03-30 23:26:26 -04:00
Vijay Janapa Reddi
26e0ab3856 restructure interviews/ with vault separation and per-directory licenses
- Move corpus, taxonomy, chains, scripts into interviews/vault/
- Rename interviews/staffml/ (was interviews/staffml/) as the branded app
- Add CC BY-NC-SA 4.0 LICENSE to: book, kits, labs, slides, instructors, interviews
- Add AGPL-3.0 LICENSE to interviews/staffml/ (the app)
- Add vault LICENSE for pipeline scripts
- Update all GitHub Actions workflows for new paths
- Update README links and vault.yaml export paths
- Fix regex patterns in site/book deploy workflows

License structure:
  interviews/LICENSE      — CC BY-NC-SA 4.0 (corpus + data)
  interviews/staffml/LICENSE — AGPL-3.0 (app code)
  interviews/vault/LICENSE   — pipeline copyright
  book|kits|labs|slides|instructors/LICENSE — CC BY-NC-SA 4.0
  tinytorch/LICENSE       — Apache 2.0 (unchanged)
2026-03-25 15:18:14 -04:00