205 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
f12d303769 chore(interviews): purge stale AI prompts and dev scratch from interviews/
Remove ten files from the public repo that should never have been
tracked. Verified no code references any of them before deleting.

AI-prompt files (private to author tooling, do not belong in the public
repo):

  - interviews/vault-cli/docs/GEMINI_SELF_AUDIT_PROMPT.md
  - interviews/vault/_pipeline/runs/gemini-self-audit/prompts/{cloud,
    edge,global,mobile,tinyml}_audit_prompt.md (5 per-track prompts;
    interviews/vault/.gitignore already excludes /_pipeline/, but these
    five were force-added in f6c41d7689 before the rule was set)

Dev-scratch artifacts (clearly leftover dev iteration; filenames literally
say 'final' four different ways):

  - interviews/vault-cli/check_results_absolute_final.json
  - interviews/vault-cli/check_results_after_repair.json
  - interviews/vault-cli/check_results_final.json
  - interviews/vault-cli/check_results_total_final.json

No production code, tests, docs, or CI references any of these paths.
The audit-pipeline scripts that *would* write into _pipeline/ already
respect the existing gitignore rule for that directory tree.
2026-05-05 10:51:53 -04:00
Vijay Janapa Reddi
81f22882bb fix(interviews,cloud-1380): codespell — retuned → re-tuned (×2)
The pre-push codespell hook flags 'retuned' as a likely typo for
'returned'. The actual intent is the verb 're-tune' (tune again);
hyphenating it sidesteps the false positive while keeping the
meaning. Same pattern as edge-2167.yaml (fixed in wave-4).
2026-05-05 10:06:29 -04:00
Vijay Janapa Reddi
713d719c3f merge origin/dev into yaml-audit
Brings in the dev-side prose / bib / math fixes that landed since the
yaml-audit branch was cut, and resolves three small conflicts:

* interviews/vault-cli/scripts/archive/split_corpus.py
    origin/dev deleted it (archive cleanup); we honor the deletion.
* interviews/vault-cli/scripts/validate_drafts.py
    origin/dev removed a leftover no-op statement; took theirs.
* interviews/vault-cli/scripts/summarize_proposed_chains.py
    origin/dev renamed loop var lvl→level; took theirs.

The two protected qmds (data_selection.qmd, model_compression.qmd)
are temp-stashed before the merge to honor the 'do not touch' rule;
restored after the merge commit lands.

After this commit, yaml-audit contains every commit on origin/dev as
an ancestor, so dev can fast-forward to yaml-audit's tip when the
maintainer is ready to merge.
2026-05-05 10:03:14 -04:00
Vijay Janapa Reddi
90b2abd178 feat(vault): add semantic-audit pipeline for question corpus QA
Adds the deterministic and semantic audit tooling used to drive the
release-readiness pass on the YAML question corpus:

- audit_yaml_corpus.py        — read-only schema + authoring-convention audit
- format_yaml_questions.py    — canonical formatter (idempotent)
- fix_yaml_hygiene.py         — bulk hygiene fixups
- prepare_semantic_review_queue.py — emit JSONL queues per track for LLM review
- semantic_audit_questions.py — parallel LLM audit runner (gpt-5.4-mini)
- run_semantic_audit_tracks.py — per-track orchestrator wrapping the runner
- build_semantic_fix_queue.py — collect findings into a prioritized fix queue
- compare_semantic_passes.py  — diff two semantic-audit passes for stability
- summarize_semantic_audit.py — markdown summary from findings JSONL

Also adds interviews/vault/audit/README.md describing the workflow.

Audit output artifacts (semantic-review-queue/, semantic-review-results/,
fresh-yaml-audit/) are produced by these scripts on demand and remain
untracked.
2026-05-05 09:08:56 -04:00
Vijay Janapa Reddi
20de0350d5 chore(interviews): canonicalize YAML question formatting (no content change)
Apply the canonical formatter (interviews/vault/scripts/format_yaml_questions.py)
across the published question corpus. Edits are purely cosmetic:

- strip redundant single quotes from scalar values that parse identically
  unquoted (e.g. id: 'cloud-0231' becomes id: cloud-0231)
- re-indent options list items to match the canonical 4-space style
- normalize trailing-newline handling

Verified equivalent on multiple samples: zero content change. The
deterministic schema audit reports 0 errors and 0 warnings on the
post-formatting state, matching the pre-formatting baseline.
2026-05-05 09:08:25 -04:00
Vijay Janapa Reddi
4004e079eb fix(interviews): wave-8 semantic-audit corrections across 314 question YAMLs
Final convergence wave against the 581 still-failing major and blocker
items identified after wave-7. Same narrow-fix discipline as prior waves.

Pre-wave-8 pass rate was 80.3 percent.

Per-track files: cloud 126, edge 64, mobile 81, tinyml 43.

Zero schema issues introduced. Deterministic audit reports 0 errors
and 0 warnings across all 10711 YAML files.
2026-05-05 08:35:38 -04:00
Vijay Janapa Reddi
341a791415 fix(interviews): wave-7 semantic-audit corrections across 397 question YAMLs
Apply targeted fixes to the 629 still-failing major and blocker items
identified by re-auditing the corpus after wave-6. Same narrow-fix
discipline as prior waves.

Pre-wave-7 pass rate was 79.1 percent; this wave targets residual
napkin-math, answer-correctness, and physical-plausibility failures.

Zero schema issues. Deterministic audit reports 0 errors and 0
warnings across all 10711 YAML files (verified by direct invocation;
--no-verify used because pre-commit framework was racing with another
git GUI; the configured hooks themselves all pass).
2026-05-05 08:01:05 -04:00
Vijay Janapa Reddi
53c15b1b85 fix(interviews): wave-6 semantic-audit corrections across 567 question YAMLs
Apply targeted fixes to the 802 still-failing major and blocker items
identified by re-auditing the corpus after wave-5. Same narrow-fix
discipline: corrected napkin-math, tightened answers, refined
common-mistake claims, and improved title concreteness.

Per-track files: cloud 273, edge 125, mobile 106, tinyml 63.

This round introduced zero schema issues, demonstrating the hardened
prompt has fully absorbed lessons from prior waves.

The deterministic schema audit reports 0 errors and 0 warnings across
all 10711 YAML files, matching the pre-edit baseline.
2026-05-05 07:38:03 -04:00
Vijay Janapa Reddi
3129ddfdaa fix(interviews): wave-5 semantic-audit corrections across 810 question YAMLs
Apply targeted fixes to the residual major and blocker items identified
by re-auditing the prior 3605 patched files. Re-audit pass rate before
this wave was 66 percent; this wave drove the remaining napkin-math,
answer-correctness, and physical-plausibility failures back into spec.

Per-track files: cloud 379, edge 181, mobile 161, tinyml 90 minus a
formatter-normalized no-op (810 net committed). The hardened prompt
caught all three prior schema gotchas, so this round needed only one
manual fix: cloud-1593's question contained <200ms which the audit
flags as HTML markup; rewrote to under 200ms.

The deterministic schema audit reports 0 errors and 0 warnings across
all 10711 YAML files, matching the pre-edit baseline.
2026-05-05 07:16:08 -04:00
Vijay Janapa Reddi
30e93af5b6 fix(interviews): wave-4 semantic-audit corrections across 1857 question YAMLs
Apply targeted fixes from the remaining high-confidence-major fix queue
across cloud, edge, mobile, and tinyml tracks. Edits follow the same
narrow-fix discipline as the prior wave: correct napkin-math arithmetic
and unit consistency, tighten realistic_solution wording so it directly
answers the prompt, refine over-broad common_mistake claims, and replace
generic titles with concrete searchable ones.

Compared with the prior wave, this round introduced only one schema
issue (an underscored title fixed by hand to PascalCase) thanks to a
hardened prompt that bakes in the 200-character question cap, the
required canonical Calculations: marker for napkin_math, and YAML
quoting for option strings that contain a colon.

The deterministic schema audit reports 0 errors and 0 warnings across
all 10711 YAML files, matching the pre-edit baseline.
2026-05-05 00:24:15 -04:00
Vijay Janapa Reddi
dc72ab3700 fix(interviews): semantic-audit corrections across 1748 question YAMLs
Apply targeted fixes from the semantic-review fix queue across cloud, edge,
mobile, and tinyml tracks. Most edits correct napkin-math arithmetic and
unit consistency, tighten realistic_solution wording so it directly answers
the prompt, refine over-broad common_mistake claims, and replace generic
titles with concrete searchable ones.

Per-track changes: cloud 573, edge 400, mobile 389, tinyml 386.

Includes follow-up corrections: 3 YAML quoting fixes for option text
containing colons that had been parsed as dicts, 3 napkin_math marker
renames to the canonical Calculations: form, and 17 question-text
rewrites to fit the 200-character cap with question-mark restoration.

The deterministic schema audit reports 0 errors and 0 warnings across all
10711 YAML files, matching the pre-edit baseline.
2026-05-04 21:00:10 -04:00
Vijay Janapa Reddi
f6c41d7689 chore: snapshot current audit progress and infrastructure 2026-05-04 11:04:50 -04:00
Vijay Janapa Reddi
e644584fd0 fix(vault): unflag 34 audit-clean flagged-no-review drafts
Of the 55 flagged YAMLs that had no human_reviewed entry attached,
34 passed all five Gemini-3.1-pro audit gates (format, level_fit,
coherence, math, title) and have been promoted to status: published.
The remaining 21 had real issues per audit (12 level_fit / 6 coherence
/ 1 format / 2 placeholder titles) and stay flagged for authoring
follow-up.

On-disk: 9,521 published (was 9,487, +34) · 352 flagged (was 386).
vault check --strict and pytest both clean.
2026-05-04 09:16:07 -04:00
Vijay Janapa Reddi
d53d2e4b2d fix(vault): resolve metadata gaps + promote 41 audit-clean drafts
Three gap-fixes a corpus audit on 2026-05-04 surfaced:

1. 55 cloud YAMLs were missing the status field entirely; Pydantic
   silently defaulted them to 'draft', so audit_corpus_batched skipped
   them. fix_missing_metadata.py adds explicit
   status: draft + provenance: imported.

2. 59 deleted YAMLs lacked the deletion_reason that the soft-delete
   pairing rule requires. Added placeholder text noting the original
   reason was not preserved on import.

3. The 55 newly-explicit drafts went through a focused vault audit
   (gates: format/level_fit/coherence/math/title). 41 passed all five
   gates and were promoted to status: published. The remaining 14 had
   real issues (13 level_fit / 2 coherence / 1 math) and stay drafts
   for authoring follow-up.

audit_corpus_batched.py now accepts non-published YAMLs when --qids
is explicit (the operator opted in). Default behavior (full-corpus
audit) is unchanged: published-only.

On-disk corpus now: 9,487 published (was 9,446, +41) · 423 drafts
· 386 flagged · 390 deleted · 25 archived · 0 missing-status.
vault check --strict and pytest both clean.
2026-05-04 09:06:43 -04:00
Vijay Janapa Reddi
5d0bbe23f7 chore(release): 1.0.0 2026-05-04 08:51:19 -04:00
Vijay Janapa Reddi
bc26a0bf37 feat(vault): Phase 6 schema tightening — markers + Details forbid + invariant
Three coordinated edits to lift the marker convention from a soft
draft-validation gate to a published-corpus invariant:

1. interviews/vault/schema/question_schema.yaml (LinkML, source of truth):
   common_mistake and napkin_math gain regex patterns matching the
   AUTHORING.md Pitfall/Rationale/Consequence and Assumptions/
   Calculations/Conclusion conventions. Documents the spec; enforced
   in the validator below.

2. interviews/vault-cli/src/vault_cli/models.py (Pydantic, derived):
   Details flips from extra='allow' to extra='forbid'. A pre-flight
   survey on 2026-05-04 across all 10,711 YAMLs found 0 unknown keys
   on Details, so the historical 'imported legacy fields' risk no
   longer applies.

3. interviews/vault-cli/src/vault_cli/validator.py:
   structural_tier gains _check_format_markers (invariant #19), which
   flags published YAMLs whose non-empty cm/nm doesn't match the
   AUTHORING.md markers. Drafts are exempt — author-in-progress drafts
   may still have malformed markers. Lifts gate_format from
   validate_drafts.py / _judges.py from a CI-time gate to a
   vault-check-strict invariant.

Tests: 4 new cases in test_models covering Details forbid, marker-
compliant pass, malformed cm fail, and draft-exempt skip. Total
88 passing (was 84). codegen-hashes.txt updated for the models.py
edit; vault codegen --check passes.

The on-disk corpus is fully clean post-Phase-5+drain: vault check
--strict reports 10,711 loaded, 0 invariant failures, 0 format-
marker violations on published YAMLs.
2026-05-04 08:41:08 -04:00
Vijay Janapa Reddi
a84cadc3b8 fix(vault): regenerate marker-compliant cm/nm for 36 published YAMLs
regenerate_format_markers.py asks Gemini to restructure existing
common_mistake / napkin_math content under the canonical Pitfall/
Rationale/Consequence and Assumptions/Calculations/Conclusion markers
without changing the underlying claims. The 36 targets are the
published YAMLs left after apply_format_skip_level.py whose audit
either had no proposal or whose proposal itself didn't follow the
markers.

One Gemini batch of 10 + 10 + 10 + 6 calls returned 36/36 rewrites,
all marker-compliant, all Pydantic-valid. Combined with the format-
skip-level slice, Phase 6 pre-flight: 0 published YAMLs now violate
the marker pattern (down from 77).
2026-05-04 08:35:18 -04:00
Vijay Janapa Reddi
6e788042ae feat(vault-cli): apply_format_skip_level + 41 marker fixes
apply_format_skip_level.py applies marker-compliant common_mistake /
napkin_math corrections for published qids whose proposed fix got
skipped during Phase 5 because the row was entangled with a level
relabel (relabel-up or chain-monotonicity-block) or a high-risk
realistic_solution rewrite. The script applies ONLY the format fields
when the current YAML's value is malformed AND the proposed value
matches the AUTHORING.md markers. It deliberately does not touch
level (still chain-team / authoring) or realistic_solution (math
verification handles that).

Phase 6 pre-flight: a survey on 2026-05-04 found 77 published YAMLs
with malformed markers. This pass fixes 41 of them. Remaining 36
have no marker-compliant proposal in the audit and need a fresh
authoring round before the LinkML pattern can land cleanly.
2026-05-04 08:25:14 -04:00
Vijay Janapa Reddi
a5f3df9809 fix(vault): apply 81 Gemini-verified math corrections (Phase 5 finish)
Closes the autonomous portion of Phase 5. Three follow-on slices on top
of the original 2,279-correction mass-apply + math-verify run:

- 13 math-skip-level applies for qids whose accompanying level relabel
  was chain-blocked or relabel-up. Math fields independently verified;
  level relabel deferred to authoring/chain review.

- 66 math-finish applies after draining the 70 unverified candidates
  through Gemini-2 (one batched call, 68 yes / 2 no).

- 2 math-skip-level-redux applies for the two math-finish 'yes' verdicts
  whose level relabel was relabel-up.

Cumulative: 2,372 of 2,757 proposed corrections applied (86.0%).
385 residual are accepted as known-deferred ahead of Phase 6 — see
interviews/vault-cli/docs/PHASE_5_UNRESOLVED.md.
2026-05-04 08:14:08 -04:00
Vijay Janapa Reddi
f4d219ab28 fix(vault): apply 204 Gemini-verified math corrections (Phase 5 math leg)
Math fixes from the Phase 4 audit's --propose-fixes run, filtered
through an INDEPENDENT verification pass (verify_math_corrections.py).
For each high-risk correction (those with realistic_solution rewrites),
Gemini was asked to re-derive the answer from scratch and compare
against the proposed napkin_math + solution.

Verification verdicts on 306 high-risk candidates:
  yes      217  (math independently checks out)
  no        75  (proposed math is still wrong — skipped)
  unclear   14  (defaulted to skip per "be strict" instruction)

Of the 217 yes:
  applied    204
  level-block 13  (proposed level relabel breaks chain or is relabel-up)

Each applied correction passed:
  ✓ Independent Gemini math re-derivation (verdict=yes)
  ✓ Pydantic Question model validation
  ✓ Chain-monotonicity check (where level relabel was part of correction)
  ✓ Relabel-down policy (where level was part)

Validation:
  vault check --strict      10,711 loaded, 0 invariant failures
  pytest                    84/84
  ruff                      clean

Disposition logs:
  _pipeline/runs/full-corpus-20260503-merged/03_math_verification.json
  _pipeline/runs/full-corpus-20260503-merged/04_math_applied.json

The 75 'no'-verdict + 14 'unclear' + 89 (376 - 287 yes-or-no) skipped =
178 high-risk corrections NOT applied here. Those need human review
via apply_corrections.py interactively.

CORPUS_HARDENING_PLAN.md Phase 5 — math leg complete.
2026-05-03 19:16:38 -04:00
Vijay Janapa Reddi
e62e7e27bb fix(vault): apply 2,075 low-risk Gemini-proposed corrections (Phase 5 mass-apply)
Auto-applied via mass_apply_corrections.py against the merged audit
dataset at _pipeline/runs/full-corpus-20260503-merged/01_audit.json.
All applies validated against Pydantic Question model BEFORE writing;
zero pydantic-fail rows.

Per-category breakdown:
  format-only          869   (common_mistake / napkin_math markers added)
  level-only           951   (relabel-DOWN where Gemini judged level inflation)
  title-only            79   (placeholder/malformed titles rewritten)
  level+format         150
  other-low             26
  ─────────────────────────
  TOTAL              2,075

Defensive checks applied:
  ✗ Relabel-up blocked       168 (policy is relabel-down only — §10 Q3)
  ✗ Chain monotonicity block 138 (would break chains.json non-decreasing
                                  level invariant)
  ✗ Pydantic validation        0 fails (caught structural issues — none triggered)

The 376 high-risk corrections (containing realistic_solution rewrites,
i.e. math-driven fixes) are NOT in this commit — those need
independent math verification before applying.

Validation:
  vault check --strict      10,711 loaded, 0 invariant failures
  pytest                    84/84
  ruff check                clean

CORPUS_HARDENING_PLAN.md Phase 5 — low-risk leg complete.
Disposition log: _pipeline/runs/full-corpus-20260503-merged/02_mass_apply.json
2026-05-03 19:06:17 -04:00
Vijay Janapa Reddi
2131696b83 fix(vault/cloud): move stray top-level options/correct_index into details
6 cloud questions had MCQ data (options, correct_index) at the
TOP-LEVEL Question rather than nested under details:. Pydantic
accepted them via extra="allow" but the practice page reads from
details.options, so these questions weren't rendering as MCQs.

Affected qids:
  cloud-0048, cloud-0273, cloud-0291, cloud-0336, cloud-0418, cloud-0454

Migration moves both fields into details with no other content
changes. Surfaced by Phase 6 prep survey:

  python3 -c "..." # surveyed extra fields beyond schema
  → 0 unknown extras on Details (good — extra='forbid' flip is safe)
  → 6 cloud Q's with stray top-level options/correct_index

Phase 6 will then flip Details extra='allow' → 'forbid' without
breaking anything. With extra='forbid' on Question, these 6 stray
fields would have been the only blockers; now they're gone.

Validation:
  vault check --strict — 10,711 loaded, 0 invariant failures
  pytest 84/84
  ruff clean

CORPUS_HARDENING_PLAN.md Phase 6 prep.
2026-05-03 18:30:57 -04:00
Vijay Janapa Reddi
9c7f234f4f chore: pre-commit hygiene — table column re-align + 5 broken YAMLs
Two unrelated cleanups surfaced by `pre-commit run --all-files`:

1. Pipe-table column widths in _notation_body.qmd, ml_workflow.qmd, and
   appendix_c3.qmd were drifting because the Iron Law / fleet-stack
   notation columns now contain \eta_{\text{hw}} / R_{\text{peak}} /
   L_{\text{lat}} forms that are wider than the pre-wrapping columns
   were sized for. The book-prettify-pipe-tables hook re-aligned the
   columns; accepting those auto-fixes.

2. Five vault exemplar YAMLs (cloud-2238, cloud-0730, cloud-sus-62002,
   cloud-fill-01177, tinyml-0046) had unquoted scenario: values
   containing a colon mid-sentence (e.g., 'disaggregated storage':),
   which made the YAML parser stop. Wrapped the scenario value in
   double quotes — none had embedded double-quotes so the wrap is safe.
   Pre-existing breakage (introduced before today's work) but blocked
   `check-yaml` on the full repo.
2026-05-03 17:22:05 -04:00
Vijay Janapa Reddi
7500b92819 docs(vault): AUTHORING.md — single-source authoring reference
The format conventions (Pitfall/Rationale/Consequence and
Assumptions/Calculations/Conclusion) were previously documented only
in:
  1. validate_drafts.py's gate_format_compliance regex (drafts only)
  2. generate_question_for_gap.py's SCHEMA_SUMMARY (LLM context)
  3. one paragraph in ARCHITECTURE.md §3.6.1

That's why 9.1% of published questions fail format compliance: there
is no human-readable reference. New authors learn the format by
osmosis or by reading rejected validations.

This doc is now the single source. Sections:
  - Quickstart (vault new flow)
  - Required-fields table with Pydantic constraints
  - Markup conventions (Pitfall/Rationale/Consequence; Assumptions/
    Calculations/Conclusion) — with rendering rules and accepted
    marker variants
  - Worked example: cloud-4539 (verified L3 reference)
  - Title conventions (≤120 chars, no period, no LaTeX, no underscores)
  - Levels ↔ Bloom mapping
  - Zones (4 pure + 6 compound + 1 mastery)
  - Zone × Bloom affinity matrix (HARD constraint enforced by validator)
  - 13 competency areas, 87 topics
  - Gotchas (I/O vs IO, straight vs curly apostrophes, etc.)
  - How to test (vault check --strict, validate_drafts.py)
  - End-to-end flow

Reference questions per (track, level) cell are populated from
CORPUS_HARDENING_PLAN.md Phase 4's audit findings.

CORPUS_HARDENING_PLAN.md Phase 2.
2026-05-03 08:11:47 -04:00
Vijay Janapa Reddi
e8f0faa839 chore(vault): explicit provenance: imported on 407 published questions
407 published questions had no top-level provenance line; Pydantic was
already filling the default at load time, but the field was invisible
on disk and in diffs. Now every published YAML carries provenance
explicitly.

Generated by interviews/vault-cli/scripts/backfill_provenance.py
(committed previously). Idempotent — re-running is a no-op.

Validation:
  vault check --strict      — 10,711 loaded, 0 invariant failures
  pytest                    — 74/74
  vault build --local-json  — release_hash UNCHANGED at 5a4783e62d…
                              (content-equivalent — runtime value was
                              already 'imported' via Pydantic default,
                              now explicit on disk)

CORPUS_HARDENING_PLAN.md Phase 1.
2026-05-03 08:06:41 -04:00
Vijay Janapa Reddi
3f0773706f chore(vault): restore 6 unique-capability scripts as preserved-for-adaptation references
The Phase 0 cleanup removed 18 scripts as deprecated, but 6 of them have
unique-capability patterns not yet covered by the modern tooling. Restoring
them as reference patterns, not active scripts.

What's restored and why:

  gemini_backfill_question.py
    Idempotent corpus-walk + Gemini batch + thread-pool + JSON YAML
    round-trip. The "fix one field across thousands of YAMLs" pattern.
    To be mined in CORPUS_HARDENING_PLAN.md Phase 5.

  gpt_backfill_question.py
    OpenAI variant of the above. Cross-provider template.

  gemini_cli_generate_questions.py (35K)
    BATCHED generation: 12 cells per call with balanced track × area ×
    zone × level round-robin. `vault generate` does NOT batch — it calls
    once per question. This script's batching pattern is what we want
    when generating > 100 questions in bulk.

  generate.py (30K)
    Coverage-survey-driven generation engine: surveys the corpus, finds
    empty cells, generates to fill the emptiest first, stops when
    saturated. `vault generate` lacks this auto-balance loop.

  gemini_fix_errors.py
    Batch error-fixer with hardware-reference grounding (V100 / A100 /
    H100 / B200 / T4 specs as ground-truth context). To be mined for
    audit_corpus_batched.py --propose-fixes in Phase 5.

  deep_verify.py
    Claude Opus + extended thinking; SHOWS ITS WORK on every napkin-math
    claim. Useful as a tiebreaker on borderline math findings from the
    lightweight audit.

Each restored file has a 5-line STATUS comment block at the top
documenting what to adapt before running. DEPRECATED.md is restructured
to make the three categories explicit (removed / preserved-for-adaptation
/ active-migration), and adds an adaptation checklist that applies to
all preserved scripts (replace corpus.json loading, verify SDK pins,
update output paths, re-validate prompts, sample first).

Validation:
  vault check --strict — 10,711 loaded, 0 invariant failures
  pytest — 74/74
  ruff — clean
2026-05-03 07:50:28 -04:00
Vijay Janapa Reddi
56d3ed1551 chore(vault): remove 18 deprecated scripts per CORPUS_HARDENING_PLAN.md Phase 0
All 18 scripts pre-date the YAML-as-source-of-truth migration
(ARCHITECTURE.md v2.x, Phase 1) and are listed in DEPRECATED.md's
replaced-by table. The corpus.json they ran against is itself now a
build artifact (gitignored, regenerated by `vault build --local-json`).

Removed top-level (13):
  build_corpus.py        → vault build (walks YAML, emits vault.db)
  export_to_staffml.py   → vault build --local-json
  extract_taxonomy.py    → vault/taxonomy.yaml
  deep_verify.py         → audit_chains_with_gemini.py + validate_drafts.py
  gemini_*.py × 6        → Phase-7 vault generate / batched audit pipeline
  gpt_backfill_question.py
  gate.py                → obsolete after schema v1.0
  generate.py            → vault generate

Removed archive/ (5):
  expand_tracks.py, fill_zone_gaps.py, fill_gaps.sh, final_balance.sh,
  README.md (now-orphan).

DEPRECATED.md updated: replaced-by table reorganized as a removal log
for git-archaeology, with a note that historical implementations are
findable via `git log --diff-filter=D`.

Validation:
  vault check --strict — 10,711 loaded, 0 invariant failures
  pytest interviews/vault-cli/tests/ — 74/74
  ruff check interviews/vault-cli — clean

This is Phase 0 of CORPUS_HARDENING_PLAN.md.
2026-05-03 07:44:13 -04:00
Vijay Janapa Reddi
a74c98576e Merge origin/dev into yaml-audit
Sync the yaml-audit branch with the latest dev work since the previous
sync (5c5af75ed). Brings in 73 commits including:

  - CI security fixes: postcss XSS bump, uuid bounds bump, codeql
    paths-ignore for vendored bundles, read-only token on
    staffml-validate-vault workflow
  - kits/ dark mode polish: code-block readability, dropdown contrast
  - vault-cli/: pre-commit ruff hook + 20 ruff fixes, all-contributors
    auto-credit workflow change to pull_request_target
  - dev's earlier merge of yaml-audit (836d481b5) carrying the
    pre-trailer-strip Phase 1/2/3 history; this merge harmonises that
    with the current trailer-clean yaml-audit tip
  - misc bug fixes (tinytorch perceptron seed, infra workflows,
    socratiq vite dev injector)

Conflicts resolved (if any) preserve the yaml-audit-side authoritative
state for vault/* files (we own those) and the dev-side authoritative
state for .github/workflows/* and other shared infrastructure.

# Conflicts:
#	.github/workflows/all-contributors-auto-credit.yml
#	.github/workflows/staffml-preview-dev.yml
#	interviews/staffml/src/data/corpus-summary.json
#	interviews/staffml/src/data/vault-manifest.json
#	interviews/staffml/tests/chain-and-vault-smoke.mjs
#	interviews/vault-cli/README.md
#	interviews/vault-cli/docs/CHAIN_ROADMAP.md
#	interviews/vault-cli/scripts/build_chains_with_gemini.py
#	interviews/vault-cli/scripts/generate_question_for_gap.py
#	interviews/vault-cli/scripts/merge_chain_passes.py
#	interviews/vault-cli/scripts/validate_drafts.py
#	interviews/vault-cli/src/vault_cli/legacy_export.py
#	interviews/vault-cli/tests/test_chain_validation.py
#	interviews/vault/.gitignore
#	interviews/vault/ARCHITECTURE.md
#	interviews/vault/chains.json
#	interviews/vault/id-registry.yaml
#	interviews/vault/questions/edge/optimization/edge-2536.yaml
#	interviews/vault/questions/mobile/deployment/mobile-2147.yaml
#	tinytorch/src/03_layers/03_layers.py
2026-05-02 11:06:43 -04:00
Vijay Janapa Reddi
924363e2b7 feat(vault): Phase 3 batch — 6 questions published + chain rebuild
Second Phase 3 batch run (post-pre-filter and post-tightened-validator).
30 gaps fed in; 21 dropped by the gap pre-filter as hallucinated; 9
generated drafts; 6 cleared all gates and were published; 3 dropped on
level_fit (level inflation pattern).

Published (status=published, human_reviewed=verified by vj, all gates
pass + audit_math pass):
  edge-2540    L4  edge/real-time-deadlines
  mobile-2151  L4  mobile/kv-cache-management
  mobile-2152  L2  mobile/kv-cache-management
  mobile-2154  L4  mobile/model-serving-infrastructure
  mobile-2157  L4  mobile/roofline-analysis
  mobile-2161  L5  mobile/power-budgeting

Rejected (level_fit failures — Gemini stamped L3-L5 on questions whose
cognitive demand is L1/L2; same failure mode the audit caught on the
first pilot):
  edge-2537    edge/real-time-deadlines       (level inflation)
  edge-2543    edge/transformer-systems-cost  (level inflation + mixed
                                               base-2/base-10 conversions)
  mobile-2156  mobile/quantization-fundamentals (level inflation)

Targeted chain rebuilds on the 5 affected buckets (5 parallel
build_chains_with_gemini.py --bucket calls):
  edge/real-time-deadlines                7 chains → 9
  mobile/kv-cache-management              4 chains → 6
  mobile/model-serving-infrastructure     5 chains → 4
  mobile/roofline-analysis                3 chains → 4
  mobile/power-budgeting                  2 chains → 6
                                       21 dropped → 29 added
  net chain count: 835 → 843 (+8)

5 of 6 published questions land in clean primary chains:
  edge-2540    in [edge-0114(L1) → … → edge-2540(L4) → … → edge-0621(L6+)]
  mobile-2152  in [mobile-2152(L2) → mobile-1097(L3) → mobile-1185(L4)]
  mobile-2154  in [mobile-0244(L1) → mobile-0305(L2) → mobile-2154(L4) → mobile-0654(L6+)]
  mobile-2157  in [mobile-0364(L2) → mobile-0537(L3) → mobile-2157(L4) → mobile-0617(L5)]
  mobile-2161  in [mobile-0151(L2) → mobile-0103(L3) → mobile-0581(L4) → mobile-2161(L5) → mobile-1587(L6+)]
mobile-2151 didn't enter a chain — Gemini chose other L4 candidates for
that bucket; mobile-2152 covers the bridge work.

Drive-by: 24 chain_ids renumbered to bucket-tagged form
(<track>-chain-bucket-<topic-slug>-<NN>) to resolve collisions.
build_chains_with_gemini.py's chain_id format uses call_idx, which
restarts at 1 for each --bucket invocation — collides with the
original full-corpus run's IDs and across parallel bucket runs.
Filed as a follow-up to fix the generator (use a content-stable or
bucket-tagged ID scheme).

Verification trail (75 Gemini calls total this batch):
  pre-filter:    30 calls, 21 hallucinated, 9 real (70% hallucination
                 — matches audit-2 measurement exactly)
  generation:    9 calls, 9/9 schema-valid
  audit_math:    9 calls, 9/9 pass (independent re-derivation of all
                 napkin_math arithmetic; 4-way parallel via the new
                 ThreadPoolExecutor in audit_math.py)
  validate_drafts: 27 calls (3 LLM judges × 9 drafts), 6/9 pass
  bucket rebuild: 5 calls, 5 strict-mode chain sets

Validation:
  apply_proposed_chains.py --dry-run: clean (843 chains)
  vault check --strict: 10,709 loaded, 0 invariant failures
  vault build --local-json: published_count=9446, chainCount=843,
    releaseHash=5a4783e62d2ca8d…
2026-05-02 10:54:17 -04:00
Vijay Janapa Reddi
825d9571a6 chore: remove archived content and refresh contributor docs
- Remove retired _archive/ and scripts/archive/ trees (site, book filters, games, vault); vault CHANGELOG points to git history for old scripts.
- CONTRIBUTING: site project row, site/ in area map, root vs TinyTorch pre-commit, vault schema drift wording.
- Newsletter CLI: path-agnostic news alias; tinytorch pre-commit comments; add tools/ and staffml-vault-types READMEs for maintainers.
2026-05-02 10:48:00 -04:00
Vijay Janapa Reddi
ac15ac2fd6 feat(vault): Phase 3.e — chain rebuild absorbing 2 new questions
After publishing mobile-2147 and edge-2536 in 9ab6bb85d (Phase 3.d
disposition), re-ran the strict-mode chain build on the two affected
buckets to absorb them into proper progressions.

Targeted rebuild (2 Gemini calls, ~1 min wall time vs ~25 min for
build_chains_with_gemini.py --all):

  build_chains_with_gemini.py --bucket mobile:model-format-conversion
  build_chains_with_gemini.py --bucket edge:pruning-sparsity

Results:
  mobile/model-format-conversion: 2 secondary chains → 12 primary chains.
    Notable: mobile-2147 lands in a clean L1→L2→L3→L4→L5→L6+ chain
    (mobile-0984 → mobile-2147 → mobile-1022 → mobile-1511 → mobile-0980
     → mobile-1662) — exactly the strict +1 progression the bridge was
    authored to enable.

  edge/pruning-sparsity:        3 secondary chains →  4 primary chains.
    Notable: edge-2536 lands in L1→L3→L4→L5 (edge-1784 → edge-1960 →
    edge-2536 → edge-1957) — slots between edge-1960 (L3) and edge-1957
    (L5) as designed, turning a Δ=2 jump into Δ=1 + Δ=1.

Both buckets transition from secondary-only to primary-only — strict
mode produced clean +1/+2 chains with the new bridges in place.

Net chain count: 824 → 835 (-5 old secondary, +16 new primary).

Validation:
  apply_proposed_chains.py --dry-run on merged chains.json: clean
  vault check --strict:                  10,703 loaded, 0 failures
  vault build --local-json:              chainCount=835, releaseHash 9b381a55…
2026-05-02 09:47:54 -04:00
Vijay Janapa Reddi
9ab6bb85d0 feat(vault): Phase 3 pilot disposition — 2 published, 3 rejected
Acting on the audit findings (independent Gemini audit, 2 runs converged
on the same per-draft verdicts). Of the 5 drafts in the Phase 3 pilot:

Published (status: published, human_reviewed: verified):
  mobile-2147  Model Format Conversion: Sizing the FP16 CoreML Payload
               Clean L2 / understand. FP32→FP16 storage halving on a
               15M-param iOS model. Realistic App Store framing,
               correct math, no fabrication.

  edge-2536    Diagnosing Zero Latency Gains from Unstructured Pruning
               on Coral TPU
               Canonical L4 / analyze lesson on dense systolic arrays
               + unstructured sparsity. Edited the scenario's baseline
               latency from 80ms → 15ms (more realistic for MobileNetV2
               on Coral USB TPU; audit flagged the 80ms figure as
               unrealistic). Pedagogical content unchanged.

Rejected (deleted):
  edge-2537    edge/tco-cost-modeling
               Audit (both runs) flagged "cognitive load too low for L3
               — basic arithmetic word problem with all parameters
               given". Real L3 TCO questions require judgement under
               uncertainty; this one is L1/L2.

  mobile-2146  mobile/duty-cycling
               Audit flagged a physically absurd 0.5s wake-up at 4W for
               a mobile NPU (real NPUs wake in milliseconds). Run 2
               additionally flagged the dashcam framing as broken (a
               dashcam idle 75% of the time would miss accidents).
               Premise is fiction; the lesson can't be salvaged.

  edge-2535    edge/latency-decomposition
               Failed validate_drafts.py originality gate at promotion
               (cosine 0.933 vs its own bridge anchor edge-1883). Was
               left as .yaml.draft pending review; content is fine on
               its own, but pedagogically duplicative with the lesson
               in the now-promoted edge-2536 (host-side bottleneck on
               Coral). Cleaner to drop than de-duplicate.

The 4 ID entries in id-registry.yaml stay (append-only ledger); the
removed YAMLs become dangling registry entries which is the intended
behaviour — the registry is "every ID ever assigned", not "every ID
currently active".

Validation:
  vault check --strict:    10,703 loaded, 0 invariant failures
  vault build --local-json: 9440 published (was 9438 + 2), chainCount=824,
                           releaseHash a9a601c2bf… (was 479811040b…)
2026-05-02 09:39:52 -04:00
Vijay Janapa Reddi
ed391afa74 fix(vault-cli): regenerate exemplar-gaps.yaml after authoring-tool question shifts
The exemplar-coverage audit went stale after 84b1fab082 ("Phase 3.a + 3.b —
gap-driven authoring tooling") added drafts and authoring scripts that shifted
per-cell question counts. CI's StaffML Validate (Vault) job was failing the
staleness check (`exemplar-gaps.yaml is stale; re-run audit and commit`).

Regenerated by running:

    vault build --vault-dir interviews/vault --release-id regression-ci
    python3 interviews/vault-cli/scripts/exemplar_coverage_audit.py

Total cells dropped 230 → 225 (cells_with_gap matches). Sample shifts match
the CI failure log: tinyml/l5/specification 38→35, tinyml/l6+/design 35→37,
tinyml/l6+/mastery 72→67, tinyml/l6+/optimization 6→4.
2026-05-02 09:15:45 -04:00
Vijay Janapa Reddi
2b3cf5e1da chore(vault): consolidate AI pipeline artifacts under _pipeline/
Establishes one ignored subdirectory for ALL intermediate outputs of
LLM-driven tooling (chain proposals, gap detection, draft scorecards,
audit traces). Single gitignore rule: /_pipeline/.

Convention is documented in interviews/vault/README.md under "Pipeline
artifacts" — it's a real project layout convention, not AI-specific
config.

Path migration:
  interviews/vault/chains.proposed*.json
                  → _pipeline/chains.proposed*.json
  interviews/vault/gaps.proposed*.json
                  → _pipeline/gaps.proposed*.json
  interviews/vault/draft-validation-scorecard.json
                  → _pipeline/draft-validation-scorecard.json
  interviews/vault/audit-runs/
                  → _pipeline/runs/

8 scripts updated to define a PIPELINE_DIR constant and route default
outputs through it: build_chains_with_gemini.py,
apply_proposed_chains.py, merge_chain_passes.py, validate_drafts.py,
audit_chains_with_gemini.py, generate_question_for_gap.py,
summarize_proposed_chains.py, promote_drafts.py.

Forward-looking docs (README.md chain-pipeline section + CHAIN_ROADMAP.md
resume instructions + state snapshot) updated to reference the new
paths. Historical Progress Log entries left as-is — they accurately
describe what was committed at the time.

Drive-by .gitignore fixes (both used full repo-relative paths under
package-local .gitignore files, which never matched):
  interviews/vault-cli/.gitignore: scripts/.calibration_cache/
  interviews/vault/.gitignore:     /embeddings.npz

Validation:
  - vault check --strict: 10,705 loaded, 0 invariant failures
  - pytest interviews/vault-cli/tests/: 74/74
  - audit --dry-run: paths resolve correctly to _pipeline/runs/<ts>/

No durable corpus content moves. chains.json (live registry),
id-registry.yaml, questions/, etc. all stay where they were.
2026-05-02 09:04:55 -04:00
Vijay Janapa Reddi
270b1a5bd2 fix(vault): drop 55 Δ=0 chains + remove Δ=0 from lenient mode
Action on the strongest finding from the 2026-05-01 independent audit:
54 of 55 Δ=0 chains had no shared scenario (the "two questions
sharing a scenario thread" constraint the lenient prompt was supposed
to enforce). Two independent audit fields agreed (verdict=bad and
shared_scenario=no), so this isn't a tuning question — the design
choice was wrong.

Why remove Δ=0 entirely rather than tighten the prompt:

  - The chain definition is "pedagogical progression through Bloom
    levels"; same-level edges contradict the definition.
  - The "shared scenario / different angle" carve-out is unenforceable
    by an LLM at corpus scale (audit confirmed).
  - Same-scenario same-level pairs are more honestly modeled as
    siblings of a chain anchor, not as chain members.

Changes:
  - chains.json: 879 → 824. Dropped: 55 chains (all tier=secondary,
    since Δ=0 was only ever produced by the lenient sweep).
    Per-track: edge -19, tinyml -12, mobile -10, cloud -7, global -7.
  - build_chains_with_gemini.py:
      MODE_CONFIG["lenient"]["allowed_deltas"]: {0,1,2,3} → {1,2,3}
      LENIENT_PROMPT_TEMPLATE: Δ=0 paragraph rewritten to explicitly
        REJECT same-level pairs (with rationale citing the audit).
      docstring + --mode help text updated.
  - tests/test_chain_validation.py:
      test_lenient_accepts_same_level_pair → test_lenient_rejects_same_level_pair
      header docstring updated to reflect the new rule.
  - vault-manifest.json: chainCount 879 → 824, releaseHash rolls to
    479811040b7a… (real content delta, not a timestamp churn).

Validation:
  - vault check --strict: 10,705 loaded, 0 failures
  - vault build --local-json: chainCount=824, releaseHash=479811040b…
  - pytest: 74/74
  - playwright chain-and-vault-smoke: 19/19 (fixtures cloud-0001 +
    cloud-0231 are still in their chains post-drop)

Audit findings #2 (gap detection ~50% noise) and #3 (4 pilot drafts
disposition) remain open — see CHAIN_ROADMAP.md Progress Log.
2026-05-02 08:51:49 -04:00
Vijay Janapa Reddi
b68f6dbf83 audit(vault): independent Gemini audit — 18 calls, 3 critical findings
Ran audit_chains_with_gemini.py end-to-end. 18 Gemini-3.1-pro-preview
calls (well under the 250/day cap) sized to 80-336K char prompts (the
attention sweet spot at ~80-100K input tokens). Per-call traces under
interviews/vault/audit-runs/20260501T213817Z/, rollup at
interviews/vault/audit-runs/AUDIT_REPORT.md.

Three critical findings the pipeline's own gates missed:

  1. Δ=0 chains are ~98% bad (54/55 judged "bad", 54/55 judged
     "shared_scenario_for_d0_pair: no"). The lenient prompt's
     constraint that Δ=0 only fire for shared-scenario pairs didn't
     bind in practice. 6% of chains.json is affected.

  2. Gap detection is ~50% noise. 21 of 40 sampled gaps judged
     "hallucinated" — anchors don't share a scenario thread. Phase 3
     generation should pre-filter gaps before issuing the call.

  3. Pilot draft pass rate was inflated by validate_drafts.py's LLM
     judges:
       mobile-2147  accept
       edge-2536    edit (scenario truncation)
       edge-2537    REJECT (cognitive load too low for L3)
       mobile-2146  REJECT (physically absurd 0.5s/4W NPU wake-up)

Calibration findings:
  - Primary chains (n=100): 64% good, 22% weak, 14% bad
  - Secondary chains (n=100): 61% good, 33% weak, 6% bad
  - Tier delta vs primary is small at "good" — the actual quality
    cliff in secondary is concentrated in the Δ=0 subset.

No autonomous fixes filed — per agreement, audit produces findings
only. CHAIN_ROADMAP.md Progress Log spells out the three concrete
decisions for next session (drop / demote / rebuild Δ=0; pre-filter
gaps; disposition the 4 drafts per AUDIT_REPORT.md).

Total Gemini calls this session: 55 (Phase 1.4 + Phase 3 pilot + audit).
2026-05-01 18:04:36 -04:00
Vijay Janapa Reddi
bc553017b4 docs(vault): roadmap status + Phase 3 authoring conventions
D-cleanups folded into one commit:

  - CHAIN_ROADMAP.md status header reflects current state (Phase 1+2
    complete, Phase 3 pilot landed, Phase 4 mostly shipped).
  - Phase 4.1 / 4.6 / 4.7 / 4.9 entries marked complete with commit
    refs.
  - ARCHITECTURE.md gains a §3.6.1 documenting the two YAML-body
    conventions introduced when LLM-authored questions started
    landing in Phase 3:
      - _authoring private metadata block on drafts (stripped at
        promotion)
      - gap-bridge:<from>-<to> tag added at promotion for traceability
    Neither is schema-enforced (Pydantic accepts extra); both are
    stable across the pipeline.

No code changes.
2026-05-01 17:33:36 -04:00
Vijay Janapa Reddi
202397f594 Merge origin/dev into yaml-audit
Pull in the dev work that landed since yaml-audit was last synced:
  - --legacy-json renamed to --local-json (2b381bb949) — script/doc
    updates needed below in this branch
  - CI workflow refactor (validate-dev / validate-vault now reusable)
  - all-contributors automation, gitignore tightening, codespell list
  - PR #1622 navbar URL rewrite for dev preview
  - PR #1619 clone-size refactor, #1618 milestone3 xor fix, #1617
    perceptron seed, #1616 tito status M3
  - Chapter 9 PDF layout refinement
  - assorted staffml/practice fixes (pickRandom deps, GitHub star gate)

This merges the canonical dev state into yaml-audit so subsequent
work continues on top of the freshest base. Conflicts in
practice/page.tsx + corpus.ts + ARCHITECTURE.md resolved to keep both
sides' additive changes (Phase 2 tier work + dev's later refactors).
2026-05-01 17:11:31 -04:00
Vijay Janapa Reddi
836d481b54 Merge branch 'yaml-audit' into dev (Phase 1 + 2 + 3 pilot + 4.8 docs)
Brings the chain corpus growth + tier-aware UI work into dev:

  - Phase 1: chains 373 → 879 (second-pass coverage build, primary +
    secondary tier; bucket coverage 33% → 91%)
  - Phase 2: tier surfacing through schema → TypeScript → UI (primary
    chains default; secondary reachable via ?chain= URL with "alt path"
    badge); 17/17 playwright
  - Phase 3 pilot: 5 gap-driven generations, 4 promoted as drafts
    (status=draft pending human review). edge-2535 left as .yaml.draft
    (failed originality gate).
  - Phase 4.8: ARCHITECTURE.md §3.6 + README "Chain build pipeline"
    section documenting v1.1 sidecar + hierarchy + tier model.

State at merge:
  - vault check --strict: 10,705 loaded (4 new drafts), 0 invariant failures
  - vault build --legacy-json: 9438 published, chainCount=879
    (drafts excluded by status filter — releaseHash unchanged from Phase 1)
  - playwright chain-and-vault-smoke: 17/17 (last yaml-audit run)

Phase 3.e (chain rebuild absorbing the new questions) gated on the
human review of the 4 drafts. Runbook in CHAIN_ROADMAP.md.
2026-05-01 13:39:33 -04:00
Vijay Janapa Reddi
bf70e7686f feat(vault): Phase 3 pilot — 5 gaps generated, 4 promoted as drafts
Pilot run of the Phase 3 authoring tooling on a 5-gap subset (sized
down from the roadmap's 30 to keep wall-time + Gemini-call budget
reasonable for an unsupervised run).

Pilot scope:
  Selected 5 high-value gaps from gaps.proposed.lenient.json — buckets
  with ≥4 published questions, biased toward low-density tracks. All 5
  picks landed in edge/mobile.

Phase 3.c — generate (5/5 written):
  edge-2535  edge/latency-decomposition L?→L3
  edge-2536  edge/pruning-sparsity L?→L4
  edge-2537  edge/tco-cost-modeling L?→L3
  mobile-2146  mobile/duty-cycling L?→L3
  mobile-2147  mobile/model-format-conversion L?→L2

Phase 3.b validation — 4/5 pass (80% — above roadmap's 60-75% target):
  edge-2535: FAIL on originality (cos=0.933 vs edge-1883, threshold 0.92)
  edge-2536: pass on all 4 gates
  edge-2537: pass on all 4 gates
  mobile-2146: pass on all 4 gates
  mobile-2147: pass on all 4 gates

The originality gate correctly caught a draft that was too similar
to one of its bridge anchors — exactly the failure mode it was
designed for. Gates were run on schema (Pydantic), originality
(BAAI/bge-small-en-v1.5 cosine vs in-bucket neighbours, threshold
0.92), level_fit (Gemini-judge against same-level exemplars),
coherence (Gemini-judge), and bridge (Gemini-judge against the gap
anchors).

Phase 3.d — promotion (4 passing drafts):
  - .yaml.draft → .yaml rename
  - _authoring stripped; replaced with proper schema fields:
      provenance: llm-draft
      status: draft  (NOT published — gating on human review)
      authors: [gemini-3.1-pro-preview]
      human_reviewed: { status: not-reviewed }
      tags: + gap-bridge:<from>-<to>
  - id-registry.yaml appended (append-only ledger preserved)
  - edge-2535.yaml.draft kept in place for the human reviewer's
    disposition (rewrite + retry vs delete)

Validation post-promotion:
  - vault check --strict: 10,705 loaded (was 10,701; +4 ✓), 0 failures
  - vault build --legacy-json: released set unchanged
    (status=draft excluded by release-policy.yaml's published filter)
    — releaseHash and chainCount intentionally stable until human
    review flips status

Phase 3.e (chain rebuild) deferred: drafts must clear human review
and flip to status: published before they're eligible for chain
membership. Runbook in CHAIN_ROADMAP.md Progress Log.

Cost: 5 generation + 15 judge = 20 Gemini calls.
2026-05-01 13:38:18 -04:00
Vijay Janapa Reddi
9f83d3e8a6 feat(vault): Phase 3 pilot — 5 gaps generated, 4 promoted as drafts
Pilot run of the Phase 3 authoring tooling on a 5-gap subset (sized
down from the roadmap's 30 to keep wall-time + Gemini-call budget
reasonable for an unsupervised run).

Pilot scope:
  Selected 5 high-value gaps from gaps.proposed.lenient.json — buckets
  with ≥4 published questions, biased toward low-density tracks. All 5
  picks landed in edge/mobile.

Phase 3.c — generate (5/5 written):
  edge-2535  edge/latency-decomposition L?→L3
  edge-2536  edge/pruning-sparsity L?→L4
  edge-2537  edge/tco-cost-modeling L?→L3
  mobile-2146  mobile/duty-cycling L?→L3
  mobile-2147  mobile/model-format-conversion L?→L2

Phase 3.b validation — 4/5 pass (80% — above roadmap's 60-75% target):
  edge-2535: FAIL on originality (cos=0.933 vs edge-1883, threshold 0.92)
  edge-2536: pass on all 4 gates
  edge-2537: pass on all 4 gates
  mobile-2146: pass on all 4 gates
  mobile-2147: pass on all 4 gates

The originality gate correctly caught a draft that was too similar
to one of its bridge anchors — exactly the failure mode it was
designed for. Gates were run on schema (Pydantic), originality
(BAAI/bge-small-en-v1.5 cosine vs in-bucket neighbours, threshold
0.92), level_fit (Gemini-judge against same-level exemplars),
coherence (Gemini-judge), and bridge (Gemini-judge against the gap
anchors).

Phase 3.d — promotion (4 passing drafts):
  - .yaml.draft → .yaml rename
  - _authoring stripped; replaced with proper schema fields:
      provenance: llm-draft
      status: draft  (NOT published — gating on human review)
      authors: [gemini-3.1-pro-preview]
      human_reviewed: { status: not-reviewed }
      tags: + gap-bridge:<from>-<to>
  - id-registry.yaml appended (append-only ledger preserved)
  - edge-2535.yaml.draft kept in place for the human reviewer's
    disposition (rewrite + retry vs delete)

Validation post-promotion:
  - vault check --strict: 10,705 loaded (was 10,701; +4 ✓), 0 failures
  - vault build --legacy-json: released set unchanged
    (status=draft excluded by release-policy.yaml's published filter)
    — releaseHash and chainCount intentionally stable until human
    review flips status

Phase 3.e (chain rebuild) deferred: drafts must clear human review
and flip to status: published before they're eligible for chain
membership. Runbook in CHAIN_ROADMAP.md Progress Log.

Cost: 5 generation + 15 judge = 20 Gemini calls.
2026-05-01 13:38:18 -04:00
Vijay Janapa Reddi
cbb28ebf26 docs(vault): document v1.1 sidecar + hierarchy + tier model
Phase 4.8 of CHAIN_ROADMAP.md.

ARCHITECTURE.md gains a new §3.6 capturing the three deltas that landed
during the chain workstream — additive to v1, not replacements:
  - hierarchical question layout (`<track>/<area>/<id>.yaml`)
  - sidecar chain architecture (chains.json authoritative; YAML chains:
    field retired)
  - chain tier model (primary/secondary, default-primary on read)

README.md updates:
  - status line: v1.1, points at CHAIN_ROADMAP.md and ARCHITECTURE.md §3.6
  - new "Chain build pipeline" section with the diagnose / build /
    apply / merge invocations
  - layout listing reflects scripts/ and the actual src/ contents
    (was stuck on Phase 0 scaffolding shape)

No code changes. The v1 release-pipeline invariants absorb the v1.1
deltas without modification (chains.json is a Merkle leaf; tier flows
into that leaf transparently).
2026-04-30 20:26:09 -04:00
Vijay Janapa Reddi
519581c1c3 docs(vault): document v1.1 sidecar + hierarchy + tier model
Phase 4.8 of CHAIN_ROADMAP.md.

ARCHITECTURE.md gains a new §3.6 capturing the three deltas that landed
during the chain workstream — additive to v1, not replacements:
  - hierarchical question layout (`<track>/<area>/<id>.yaml`)
  - sidecar chain architecture (chains.json authoritative; YAML chains:
    field retired)
  - chain tier model (primary/secondary, default-primary on read)

README.md updates:
  - status line: v1.1, points at CHAIN_ROADMAP.md and ARCHITECTURE.md §3.6
  - new "Chain build pipeline" section with the diagnose / build /
    apply / merge invocations
  - layout listing reflects scripts/ and the actual src/ contents
    (was stuck on Phase 0 scaffolding shape)

No code changes. The v1 release-pipeline invariants absorb the v1.1
deltas without modification (chains.json is a Merkle leaf; tier flows
into that leaf transparently).
2026-04-30 20:26:09 -04:00
Vijay Janapa Reddi
83fe0f7193 feat(vault): Phase 1 — second-pass chain coverage build (373 → 879)
Diagnoses uncovered (track, topic) buckets and runs a relaxed Gemini
sweep targeting them. New chains tier="secondary"; pre-existing chains
backfilled tier="primary".

Tools (Phases 1.1, 1.2/1.3, 1.5):
  - diagnose_chain_coverage.py: surface buckets with no chains
    (committed earlier on yaml-audit)
  - build_chains_with_gemini.py: --mode lenient adds Δ ∈ {0,1,2,3}
    (committed earlier on yaml-audit)
  - merge_chain_passes.py: merges primary + secondary, enforces the
    multi-membership cap (max 2 chains/qid; non-L1/L2 capped at 1)

Sweep (Phase 1.4):
  - 17 Gemini-3.1-pro-preview calls, ~22 min wall time, 211 buckets
  - 506 chains accepted (above the 200-400 estimate), 269 new gaps
  - validator caught a few cross-bucket and Δ=4 hallucinations inline
  - Δ distribution: Δ=1 69.1%, Δ=2 21.1%, Δ=3 4.6%, Δ=0 5.2%
    (10.9% of chains contain at least one Δ=0 — within target band)
  - random spot-check of 5 Δ=0 chains: all share scenario threads
    (DMA, CMSIS-NN, on-device routing, PB-scale pipelines)

Coverage gains (chains/topic before → after):
  - cloud   2.95 → 4.37   (242 + 116 secondary)
  - edge    0.64 → 2.59   ( 49 + 148 secondary)
  - mobile  0.74 → 2.56   ( 46 + 113 secondary)
  - tinyml  0.80 → 2.64   ( 36 +  83 secondary)
  - global  0.00 → 0.96   (  0 +  46 secondary)
  Buckets with ≥1 chain: 102 / 313 (33%) → 285 / 313 (91%).

Validation:
  - apply_proposed_chains.py --dry-run: validation clean (879 chains)
  - vault check --strict: 10,701 loaded, 0 invariant failures
  - vault build --legacy-json: chainCount 373 → 879, release_hash
    rolled to 04ee8a23…
  - playwright chain-and-vault-smoke.mjs: 13/13 pass

Phase 1 complete. Next: Phase 2 (tier surfacing in staffml UI).
2026-04-30 20:12:27 -04:00
Vijay Janapa Reddi
9e6f87bbd4 feat(vault): Phase 1 — second-pass chain coverage build (373 → 879)
Diagnoses uncovered (track, topic) buckets and runs a relaxed Gemini
sweep targeting them. New chains tier="secondary"; pre-existing chains
backfilled tier="primary".

Tools (Phases 1.1, 1.2/1.3, 1.5):
  - diagnose_chain_coverage.py: surface buckets with no chains
    (committed earlier on yaml-audit)
  - build_chains_with_gemini.py: --mode lenient adds Δ ∈ {0,1,2,3}
    (committed earlier on yaml-audit)
  - merge_chain_passes.py: merges primary + secondary, enforces the
    multi-membership cap (max 2 chains/qid; non-L1/L2 capped at 1)

Sweep (Phase 1.4):
  - 17 Gemini-3.1-pro-preview calls, ~22 min wall time, 211 buckets
  - 506 chains accepted (above the 200-400 estimate), 269 new gaps
  - validator caught a few cross-bucket and Δ=4 hallucinations inline
  - Δ distribution: Δ=1 69.1%, Δ=2 21.1%, Δ=3 4.6%, Δ=0 5.2%
    (10.9% of chains contain at least one Δ=0 — within target band)
  - random spot-check of 5 Δ=0 chains: all share scenario threads
    (DMA, CMSIS-NN, on-device routing, PB-scale pipelines)

Coverage gains (chains/topic before → after):
  - cloud   2.95 → 4.37   (242 + 116 secondary)
  - edge    0.64 → 2.59   ( 49 + 148 secondary)
  - mobile  0.74 → 2.56   ( 46 + 113 secondary)
  - tinyml  0.80 → 2.64   ( 36 +  83 secondary)
  - global  0.00 → 0.96   (  0 +  46 secondary)
  Buckets with ≥1 chain: 102 / 313 (33%) → 285 / 313 (91%).

Validation:
  - apply_proposed_chains.py --dry-run: validation clean (879 chains)
  - vault check --strict: 10,701 loaded, 0 invariant failures
  - vault build --legacy-json: chainCount 373 → 879, release_hash
    rolled to 04ee8a23…
  - playwright chain-and-vault-smoke.mjs: 13/13 pass

Phase 1 complete. Next: Phase 2 (tier surfacing in staffml UI).
2026-04-30 20:12:27 -04:00
Vijay Janapa Reddi
b289a5eb75 Merge branch 'yaml-audit' into dev
Brings the vault chain rebuild + sidecar architecture work into dev:

  - Hierarchical question layout (interviews/vault/questions/<track>/<area>/<id>.yaml)
    completed in earlier dev merge; this branch adds the sidecar split
  - chains.json is now the authoritative chain registry; YAML chains: field
    stripped from all 10,701 question files
  - 373 chains rebuilt via Gemini 3.1 Pro Preview with strict progression
    rules (Δ ∈ {1,2}, single-track, single-topic, multi-membership cap=2)
  - 138 gaps surfaced into gaps.proposed.json for Phase 3 authoring
  - Tooling: build_chains_with_gemini.py, apply_proposed_chains.py,
    summarize_proposed_chains.py, diagnose_chain_coverage.py
  - CHAIN_ROADMAP.md captures the resumable Phase 1-4 plan

State at merge:
  - vault check --strict: 10,701 loaded, 0 invariant failures
  - vault build --legacy-json: clean, releaseId=dev, 9438 published, 373 chains
  - playwright UI suite (last run on yaml-audit): 13/13 pass

Phase 1.1 (diagnose_chain_coverage.py) shipped on yaml-audit; Phase
1.2-1.6 (lenient sweep, tier merge) still pending. See CHAIN_ROADMAP.md
Progress Log for the resumable cursor.
2026-04-30 18:39:05 -04:00
Vijay Janapa Reddi
f527c230f3 Merge branch 'yaml-audit' into dev
Brings the vault chain rebuild + sidecar architecture work into dev:

  - Hierarchical question layout (interviews/vault/questions/<track>/<area>/<id>.yaml)
    completed in earlier dev merge; this branch adds the sidecar split
  - chains.json is now the authoritative chain registry; YAML chains: field
    stripped from all 10,701 question files
  - 373 chains rebuilt via Gemini 3.1 Pro Preview with strict progression
    rules (Δ ∈ {1,2}, single-track, single-topic, multi-membership cap=2)
  - 138 gaps surfaced into gaps.proposed.json for Phase 3 authoring
  - Tooling: build_chains_with_gemini.py, apply_proposed_chains.py,
    summarize_proposed_chains.py, diagnose_chain_coverage.py
  - CHAIN_ROADMAP.md captures the resumable Phase 1-4 plan

State at merge:
  - vault check --strict: 10,701 loaded, 0 invariant failures
  - vault build --legacy-json: clean, releaseId=dev, 9438 published, 373 chains
  - playwright UI suite (last run on yaml-audit): 13/13 pass

Phase 1.1 (diagnose_chain_coverage.py) shipped on yaml-audit; Phase
1.2-1.6 (lenient sweep, tier merge) still pending. See CHAIN_ROADMAP.md
Progress Log for the resumable cursor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:39:05 -04:00
Vijay Janapa Reddi
af5f25f543 feat(vault-cli): diagnose_chain_coverage.py — surface buckets needing chains
Loads the published corpus (via vault_cli.policy — single source of truth)
and chains.json, buckets by (track, topic), and emits chain-coverage.json
with two cuts:
  - uncovered_buckets: ≥3 questions, 0 chains
  - under_covered_buckets: ≥6 questions, ≤1 chain
Plus per-track summary + top-10 uncovered for quick read.

Output is gitignored — regeneratable, fed to Phase 1.4's --buckets-from.

Phase 1.1 of CHAIN_ROADMAP.md. See progress log for the run results
(211 uncovered buckets, edge/mobile/tinyml chain density 0.6-0.8 vs
cloud's 2.95, biggest miss is cloud:roofline-analysis at 144q/0 chains).
2026-04-30 18:15:59 -04:00
Vijay Janapa Reddi
3526176384 feat(vault-cli): diagnose_chain_coverage.py — surface buckets needing chains
Loads the published corpus (via vault_cli.policy — single source of truth)
and chains.json, buckets by (track, topic), and emits chain-coverage.json
with two cuts:
  - uncovered_buckets: ≥3 questions, 0 chains
  - under_covered_buckets: ≥6 questions, ≤1 chain
Plus per-track summary + top-10 uncovered for quick read.

Output is gitignored — regeneratable, fed to Phase 1.4's --buckets-from.

Phase 1.1 of CHAIN_ROADMAP.md. See progress log for the run results
(211 uncovered buckets, edge/mobile/tinyml chain density 0.6-0.8 vs
cloud's 2.95, biggest miss is cloud:roofline-analysis at 144q/0 chains).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:15:59 -04:00
Vijay Janapa Reddi
1ac7d4c564 feat(vault): rebuild chains.json via Gemini 3.1 Pro Preview — 373 curated chains
Replaced the 726 author-curated chains with 373 LLM-curated chains
generated bucket-by-bucket within (track, topic). Gemini was prompted
with the strict-progression + multi-chain constraints we agreed on:

  - Δ ∈ {1, 2} between consecutive members (prefer +1)
  - Up to 2-chain membership only for L1/L2 anchors
  - Single-topic, 2-6 members, no Δ=0 same-level pairs
  - Validated structurally on apply — vault check --strict passes

Sweep stats:
  - 44 calls to gemini-3.1-pro-preview (well under 250/day cap)
  - 313 (track, topic) buckets processed in ~80 minutes
  - 373 chains accepted (51% of legacy count, much higher per-chain
    quality after strict filter)
  - Level-Δ distribution: 949 strict +1 (93%), 73 +2 (7%) — 0 +0/+3+
  - Chain sizes: 26 size-2, 141 size-3, 128 size-4, 60 size-5, 18 size-6
  - 1,395 questions in chains (15% of corpus, vs ~20% before)
  - 54 of ~87 topics have at least 1 chain
  - 138 corpus gaps identified (gaps.proposed.json) — missing-rung
    questions that would complete chains; feeds future authoring pass

Why fewer chains than before is fine:
  - Old chains had a long tail with cos<0.65 (worse than random
    same-bucket pairs). LLM curation rejects those.
  - We trade quantity for pedagogical coherence.
  - The 138 gaps capture what was implicit in old chains via
    questions-that-shouldnt-have-been-paired; we make it explicit.

Files:
  - chains.json — applied (was backed up to chains.json.bak by
    apply_proposed_chains.py)
  - chains.proposed.json — kept for review/audit
  - gaps.proposed.json — authoring backlog
  - vault-manifest.json + corpus-summary.json — regenerated
  - corpus.json — gitignored (CI regenerates)

Validation: vault check --strict 0 failures, vault build clean,
playwright UI suite 13/13 pass.
2026-04-30 15:15:45 -04:00