3 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
3f0773706f chore(vault): restore 6 unique-capability scripts as preserved-for-adaptation references
The Phase 0 cleanup removed 18 scripts as deprecated, but 6 of them have
unique-capability patterns not yet covered by the modern tooling. Restoring
them as reference patterns, not active scripts.

What's restored and why:

  gemini_backfill_question.py
    Idempotent corpus-walk + Gemini batch + thread-pool + JSON YAML
    round-trip. The "fix one field across thousands of YAMLs" pattern.
    To be mined in CORPUS_HARDENING_PLAN.md Phase 5.

  gpt_backfill_question.py
    OpenAI variant of the above. Cross-provider template.

  gemini_cli_generate_questions.py (35K)
    BATCHED generation: 12 cells per call with balanced track × area ×
    zone × level round-robin. `vault generate` does NOT batch — it calls
    once per question. This script's batching pattern is what we want
    when generating > 100 questions in bulk.

  generate.py (30K)
    Coverage-survey-driven generation engine: surveys the corpus, finds
    empty cells, generates to fill the emptiest first, stops when
    saturated. `vault generate` lacks this auto-balance loop.

  gemini_fix_errors.py
    Batch error-fixer with hardware-reference grounding (V100 / A100 /
    H100 / B200 / T4 specs as ground-truth context). To be mined for
    audit_corpus_batched.py --propose-fixes in Phase 5.

  deep_verify.py
    Claude Opus + extended thinking; SHOWS ITS WORK on every napkin-math
    claim. Useful as a tiebreaker on borderline math findings from the
    lightweight audit.

Each restored file has a 5-line STATUS comment block at the top
documenting what to adapt before running. DEPRECATED.md is restructured
to make the three categories explicit (removed / preserved-for-adaptation
/ active-migration), and adds an adaptation checklist that applies to
all preserved scripts (replace corpus.json loading, verify SDK pins,
update output paths, re-validate prompts, sample first).

Validation:
  vault check --strict — 10,711 loaded, 0 invariant failures
  pytest — 74/74
  ruff — clean
2026-05-03 07:50:28 -04:00
Vijay Janapa Reddi
56d3ed1551 chore(vault): remove 18 deprecated scripts per CORPUS_HARDENING_PLAN.md Phase 0
All 18 scripts pre-date the YAML-as-source-of-truth migration
(ARCHITECTURE.md v2.x, Phase 1) and are listed in DEPRECATED.md's
replaced-by table. The corpus.json they ran against is itself now a
build artifact (gitignored, regenerated by `vault build --local-json`).

Removed top-level (13):
  build_corpus.py        → vault build (walks YAML, emits vault.db)
  export_to_staffml.py   → vault build --local-json
  extract_taxonomy.py    → vault/taxonomy.yaml
  deep_verify.py         → audit_chains_with_gemini.py + validate_drafts.py
  gemini_*.py × 6        → Phase-7 vault generate / batched audit pipeline
  gpt_backfill_question.py
  gate.py                → obsolete after schema v1.0
  generate.py            → vault generate

Removed archive/ (5):
  expand_tracks.py, fill_zone_gaps.py, fill_gaps.sh, final_balance.sh,
  README.md (now-orphan).

DEPRECATED.md updated: replaced-by table reorganized as a removal log
for git-archaeology, with a note that historical implementations are
findable via `git log --diff-filter=D`.

Validation:
  vault check --strict — 10,711 loaded, 0 invariant failures
  pytest interviews/vault-cli/tests/ — 74/74
  ruff check interviews/vault-cli — clean

This is Phase 0 of CORPUS_HARDENING_PLAN.md.
2026-05-03 07:44:13 -04:00
Vijay Janapa Reddi
9955a76b92 feat(staffml): deep verification + mock NeurIPS reviews + paper improvements
Deep verification: 237-question stratified sample, 4.2% error rate found.
All 10 errors fixed (unit confusion, arithmetic, conceptual misapplication).
96 physics violations removed (impossible topic×track pairs).
Extended invariant checks added (applicability matrix enforcement).

Paper improvements from mock NeurIPS review feedback:
- Bloom critique softened ("complements" not "departs from")
- LLM generation transparency (95% ratio + 4.2% error rate disclosed)
- Scope explicitly limited to technical systems reasoning
- H100 specs corrected (989 TFLOPS, not 495)
- Track percentages reference table instead of hardcoding
- Figure captions use macros for consistency

New topics with questions: software-portability (50), comm-compute-overlap (50).
Phase metadata reclassified (42.5% inference, 37.7% both, 19.9% training).
2026-04-02 07:28:41 -04:00