3 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
3f0773706f chore(vault): restore 6 unique-capability scripts as preserved-for-adaptation references
The Phase 0 cleanup removed 18 scripts as deprecated, but 6 of them have
unique-capability patterns not yet covered by the modern tooling. Restoring
them as reference patterns, not active scripts.

What's restored and why:

  gemini_backfill_question.py
    Idempotent corpus-walk + Gemini batch + thread-pool + JSON YAML
    round-trip. The "fix one field across thousands of YAMLs" pattern.
    To be mined in CORPUS_HARDENING_PLAN.md Phase 5.

  gpt_backfill_question.py
    OpenAI variant of the above. Cross-provider template.

  gemini_cli_generate_questions.py (35K)
    BATCHED generation: 12 cells per call with balanced track × area ×
    zone × level round-robin. `vault generate` does NOT batch — it calls
    once per question. This script's batching pattern is what we want
    when generating > 100 questions in bulk.

  generate.py (30K)
    Coverage-survey-driven generation engine: surveys the corpus, finds
    empty cells, generates to fill the emptiest first, stops when
    saturated. `vault generate` lacks this auto-balance loop.

  gemini_fix_errors.py
    Batch error-fixer with hardware-reference grounding (V100 / A100 /
    H100 / B200 / T4 specs as ground-truth context). To be mined for
    audit_corpus_batched.py --propose-fixes in Phase 5.

  deep_verify.py
    Claude Opus + extended thinking; SHOWS ITS WORK on every napkin-math
    claim. Useful as a tiebreaker on borderline math findings from the
    lightweight audit.

Each restored file has a 5-line STATUS comment block at the top
documenting what to adapt before running. DEPRECATED.md is restructured
to make the three categories explicit (removed / preserved-for-adaptation
/ active-migration), and adds an adaptation checklist that applies to
all preserved scripts (replace corpus.json loading, verify SDK pins,
update output paths, re-validate prompts, sample first).

Validation:
  vault check --strict — 10,711 loaded, 0 invariant failures
  pytest — 74/74
  ruff — clean
2026-05-03 07:50:28 -04:00
Vijay Janapa Reddi
56d3ed1551 chore(vault): remove 18 deprecated scripts per CORPUS_HARDENING_PLAN.md Phase 0
All 18 scripts pre-date the YAML-as-source-of-truth migration
(ARCHITECTURE.md v2.x, Phase 1) and are listed in DEPRECATED.md's
replaced-by table. The corpus.json they ran against is itself now a
build artifact (gitignored, regenerated by `vault build --local-json`).

Removed top-level (13):
  build_corpus.py        → vault build (walks YAML, emits vault.db)
  export_to_staffml.py   → vault build --local-json
  extract_taxonomy.py    → vault/taxonomy.yaml
  deep_verify.py         → audit_chains_with_gemini.py + validate_drafts.py
  gemini_*.py × 6        → Phase-7 vault generate / batched audit pipeline
  gpt_backfill_question.py
  gate.py                → obsolete after schema v1.0
  generate.py            → vault generate

Removed archive/ (5):
  expand_tracks.py, fill_zone_gaps.py, fill_gaps.sh, final_balance.sh,
  README.md (now-orphan).

DEPRECATED.md updated: replaced-by table reorganized as a removal log
for git-archaeology, with a note that historical implementations are
findable via `git log --diff-filter=D`.

Validation:
  vault check --strict — 10,711 loaded, 0 invariant failures
  pytest interviews/vault-cli/tests/ — 74/74
  ruff check interviews/vault-cli — clean

This is Phase 0 of CORPUS_HARDENING_PLAN.md.
2026-05-03 07:44:13 -04:00
Vijay Janapa Reddi
26e0ab3856 restructure interviews/ with vault separation and per-directory licenses
- Move corpus, taxonomy, chains, scripts into interviews/vault/
- Rename interviews/staffml/ (was interviews/staffml/) as the branded app
- Add CC BY-NC-SA 4.0 LICENSE to: book, kits, labs, slides, instructors, interviews
- Add AGPL-3.0 LICENSE to interviews/staffml/ (the app)
- Add vault LICENSE for pipeline scripts
- Update all GitHub Actions workflows for new paths
- Update README links and vault.yaml export paths
- Fix regex patterns in site/book deploy workflows

License structure:
  interviews/LICENSE      — CC BY-NC-SA 4.0 (corpus + data)
  interviews/staffml/LICENSE — AGPL-3.0 (app code)
  interviews/vault/LICENSE   — pipeline copyright
  book|kits|labs|slides|instructors/LICENSE — CC BY-NC-SA 4.0
  tinytorch/LICENSE       — Apache 2.0 (unchanged)
2026-03-25 15:18:14 -04:00