Files
Vijay Janapa Reddi eb71638630 feat(vault): release-grade Phase G — full audit + cleanup + 0.1.3 release
Final brute-force release-readiness pass: every gate green, 0.1.3
released and verified, every observable failure mode closed at source.

═══ AUDITS (G.A–G.D) ═══

G.A — gemini-3.1-pro-preview default everywhere. Active CLI scripts
    already used it; bulk-patched 6 legacy scripts (`generate_batch.py`,
    `validate_questions.py`, `generate_gaps.py`, `run_reviews.sh`,
    `generate.py`, `review_math.sh`) + WORKFLOW.md off `gemini-2.5-flash`
    or `gemini-2.5-pro` to `gemini-3.1-pro-preview`. Only `archive/`
    references remain (intentionally legacy).

G.B — Cloudflare workflow audit. `vault verify 0.1.1` correctly
    failed (YAMLs evolved since 0.1.1 cut). Confirmed `vault publish`,
    `vault deploy`, `vault ship`, `vault rollback`, `vault verify`,
    `vault snapshot`, `vault tag` all wired. Released 0.1.2 then 0.1.3
    to lock final state.

G.C — Visual asset integrity audit. 236/236 YAML visual references
    resolve, 0 orphan SVGs, 0 missing files, 0 unrendered sources.
    Clean.

G.D — Unit tests for new validators added at `tests/test_models.py`:
    15 tests covering Visual.kind enum, Visual.path regex, Visual.alt
    + caption min lengths + required, Question._zone_bloom_compatible
    (recall+remember accepted, recall+evaluate rejected, mastery+
    remember rejected, evaluation+evaluate accepted, design+create
    accepted), Question._visual_path_resolves. **15/15 pass.**

═══ CONTENT CLEANUP (G.E–G.L) ═══

G.E — Sample re-judge of 100 random cloud parallelism items via
    Gemini 3.1 Pro Preview (4 API calls): 53% PASS / 23% NEEDS_FIX /
    24% DROP. Surfaced legacy quality drift — items generated under
    pre-Phase-D laxer prompts were not meeting the new strict bar
    (math errors with bidirectional vs unidirectional NVLink,
    "Based on the diagram..." references with no diagram, deprecated
    practices like SSP for modern LLM training, wrong-track scenarios
    like Cortex-M4 in cloud track).

G.H — General-purpose cleanup agent on 47 flagged items:
    **31 rewritten** with PARALLELISM_RULES bar applied (concrete
    unidirectional NVLink 450 GB/s, IB NDR 25 GB/s, RoCE v2 22 GB/s,
    PCIe Gen3 12 GB/s; multi-step ring AllReduce arguments with the
    2(N-1)/N factor; non-obvious failure modes); **16 archived** with
    documented `deletion_reason` (mathematically broken premises,
    physics errors, topic-irreconcilable, direct duplicates).

G.L — Re-judge of 31 G.H rewrites: **23 PASS / 3 NEEDS_FIX / 5 DROP =
    74.2% pass rate**. The 8 still-failing items archived (after the
    cleanup pass still couldn't satisfy the strict bar). Contract:
    items get THREE chances — original generation, fix-agent, retry-
    fix — and if they still fail, archived not promoted. Honest.

═══ STUBBORN-FAIL ARCHIVES (Phase F residuals) ═══

After three independent fix-agent passes (Phase C, F.2, F.4), 4 items
remained NEEDS_FIX or DROP: edge-2390, edge-2401, mobile-1948,
tinyml-1681. Archived with `deletion_reason` documenting the 3-attempt
failure history. The cell may be structurally awkward; preserving
items for audit but removing from the bundle.

═══ ORPHAN CHAIN FIX ═══

After archives, `cloud-chain-359` had only 1 published member
(`cloud-1840`); its sibling `cloud-1845` got archived. Dropped the
chain ref from cloud-1840 + ran `repair_chains.py` to clean residual
references in archived YAMLs. `vault check --strict` now passes 0
chain warnings.

═══ E.2 / E.3 SHIPPED EARLIER IN PRIOR COMMIT ═══

(Documented in commit `20ea20005` for completeness):
- `vault build --legacy-json` auto-emits `vault-manifest.json`.
- `analyze_coverage_gaps.py --include-areas <areas>` flag.

═══ 0.1.3 FINAL RELEASE ═══

`vault publish 0.1.3` snapshot at `releases/0.1.3/`. Migrations:
+0 ~27 -28 (zero net new questions, 27 modified during cleanup, 28
archived/promoted). `vault verify 0.1.3` ✓ — release_hash
`793c06f414f2bf8391a8a5c56ec0ff8d76bfce4ab7c64ad12ecb83f6d932280e`
reconstructs from YAML. Latest symlink → 0.1.3.

═══ FINAL ALL-9-GATES SWEEP — ALL GREEN ═══

[1] vault check --strict          ✓ 10,701 / 0 errors / 0 invariants
[2] vault lint                    ✓ 0 errors / 0 warnings / 9,757 info
[3] vault doctor                  ✓ 0 fails (registry-history info OK)
[4] vault codegen --check         ✓ artifacts in sync
[5] vault verify 0.1.3            ✓ hash reconstructs from YAML
[6] staffml validate-vault        ✓ 0 errors / 0 warnings, deployment-ready
[7] render_visuals                ✓ 236 visuals, 0 errors
[8] tsc                           ✓ TypeScript clean
[9] Playwright                    ✓ 9/9 pass

═══ FINAL CORPUS STATE ═══

Bundle: 9,757 published (was 9,224 at branch cut, **+533 net** across
the full multi-session push, after all archives).

Total commits on branch since cut: 10.
Release tag latest: 0.1.3 (verified-clean).
Status: StaffML-day-ready. Ship it.
2026-04-25 19:45:32 -04:00

5.7 KiB
Raw Permalink Blame History

StaffML Iterative QA Workflow

A standardized, reproducible pipeline for maintaining question quality.

The Loop

┌─────────────────────────────────────────────────────────┐
│                                                         │
│   MEASURE → DIAGNOSE → FIX → VALIDATE → MEASURE        │
│       ↑                                         │       │
│       └─────────────────────────────────────────┘       │
│                                                         │
│   Exit when: OK% > 95%, WARNs < 5%, ERRORs = 0         │
│                                                         │
└─────────────────────────────────────────────────────────┘

Step 1: MEASURE (scorecard.py)

python3 staffml/vault/scripts/scorecard.py

Computes:

  • OK / WARN / ERROR / Pending counts and percentages
  • 6-axis classification completeness
  • Bloom balance deviation per track
  • Reasoning mode CV per track
  • Chain coverage per track
  • Short solutions / napkin math counts
  • Taxonomy health (orphans, missing refs, overloaded concepts)
  • Duplicate candidates (scenario similarity > 0.85)

Output: _validation_results/scorecard_YYYYMMDD.json

Step 2: DIAGNOSE (auto)

From the scorecard, automatically prioritize:

  1. ERRORs → must fix (highest priority)
  2. Short napkin math (<100 chars) → incomplete stubs
  3. Missing taxonomy axes → 6-axis classification gaps
  4. Bloom balance gaps > 5% → generation targets
  5. Chain coverage < 50% → chain building needed
  6. Duplicates → dedup

Step 3: FIX (parallel agents)

Three fix modes, chosen automatically based on diagnosis:

Mode A: Math/Content Fix (for ERRORs and WARNs)

python3 staffml/vault/scripts/fix_warns.py --model gemini-3.1-pro-preview --workers 8 --batch-size 25
# Fallback: --model opus if Gemini quota hit
  • Reads questions with WARN/ERROR status
  • Sends to LLM with hardware reference sheet
  • LLM diagnoses issue and returns corrected fields
  • Applies fixes, sets status to OK

Mode B: Generation (for Bloom/mode gaps)

python3 staffml/vault/scripts/generate_gaps.py --model gemini-3.1-pro-preview --workers 6
  • Reads scorecard gap analysis
  • Generates targeted questions for worst imbalances
  • Validates immediately in a second pass
  • Adds clean questions to corpus

Mode C: Chain Building (for coverage gaps)

python3 staffml/vault/scripts/build_chains.py --model gemini-3.1-pro-preview --track edge
  • Finds unchained questions with 3+ Bloom levels per topic
  • LLM selects best question at each level
  • Outputs chain JSON, backpopulates corpus

Step 4: VALIDATE (separate pass)

python3 staffml/vault/scripts/gemini_math_review.py --batch-size 40 --workers 8
  • Reviews ALL questions (or just newly fixed ones with --status OK --since-date)
  • Chunks by track × topic for topical coherence
  • Hardware reference sheet from mlsysim/core/constants.py
  • Outputs OK/ERROR/WARN per question

Step 5: MEASURE again → compare to previous scorecard

python3 staffml/vault/scripts/scorecard.py --compare _validation_results/scorecard_previous.json

Automation: One Command

python3 staffml/vault/scripts/qa_loop.py --rounds 3 --target-ok 95

This script:

  1. Runs scorecard
  2. Diagnoses top issues
  3. Launches appropriate fix mode (A/B/C) with available model
  4. Validates fixes
  5. Re-runs scorecard
  6. If OK% < target, loops back to step 2
  7. Commits when target reached or rounds exhausted

Model Selection

Model Best for Rate limit Cost
gemini-3.1-pro Math review, generation 250/day Free tier
gemini-3.1-pro-preview Bulk validation 1500/day Free tier
claude-opus-4.6 Deep review, complex fixes No daily limit API cost

Strategy: Use Gemini for bulk, Opus for deep fixes, Flash for validation sweeps.

Hardware Reference Source of Truth

All prompts include specs from mlsysim/core/constants.py:

  • GPU: A100/H100/V100/T4 memory, bandwidth, TFLOPS
  • Interconnect: NVLink, PCIe, InfiniBand
  • Energy: Horowitz 2014 values
  • Models: GPT-3, LLaMA-2/3, BERT params/layers/dims
  • Edge: Jetson Orin, Coral, Hailo-8
  • TinyML: Cortex-M4/M7, ESP32, STM32, nRF52840

Never hardcode specs in prompts — always derive from constants.py.

Step 6: INVARIANT GATE (before commit)

python3 staffml/vault/scripts/vault_invariants.py

This runs 14 structural checks across corpus, taxonomy, and chains:

  • No duplicate concept names or IDs
  • All IDs are kebab-case (no Title Case stubs from LLM extraction)
  • question_count matches actual corpus counts (auto-fixable with --fix)
  • All competency_area and level values are canonical
  • No orphan prerequisites or graph cycles
  • All chain question IDs exist in corpus
  • No duplicate question IDs

The commit MUST NOT proceed if any check is FAIL. Run --fix first for auto-fixable issues (checks 4 and 9).

For pipeline scripts, use the gate module:

from gate import InvariantGate
with InvariantGate():
    # ... modify corpus/taxonomy ...
# automatically blocks if new FAILs appear

Commit Protocol

After each successful loop iteration:

# Verify invariants pass
python3 staffml/vault/scripts/vault_invariants.py
# Then commit
git add corpus.json chains.json staffml/src/data/corpus.json taxonomy.json
git commit -m "staffml: QA loop round N — OK X% → Y%, N fixes applied"
git push origin dev