cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-08 02:28:25 -05:00

Author	SHA1	Message	Date
Vijay Janapa Reddi	3f0773706f	chore(vault): restore 6 unique-capability scripts as preserved-for-adaptation references The Phase 0 cleanup removed 18 scripts as deprecated, but 6 of them have unique-capability patterns not yet covered by the modern tooling. Restoring them as reference patterns, not active scripts. What's restored and why: gemini_backfill_question.py Idempotent corpus-walk + Gemini batch + thread-pool + JSON YAML round-trip. The "fix one field across thousands of YAMLs" pattern. To be mined in CORPUS_HARDENING_PLAN.md Phase 5. gpt_backfill_question.py OpenAI variant of the above. Cross-provider template. gemini_cli_generate_questions.py (35K) BATCHED generation: 12 cells per call with balanced track × area × zone × level round-robin. `vault generate` does NOT batch — it calls once per question. This script's batching pattern is what we want when generating > 100 questions in bulk. generate.py (30K) Coverage-survey-driven generation engine: surveys the corpus, finds empty cells, generates to fill the emptiest first, stops when saturated. `vault generate` lacks this auto-balance loop. gemini_fix_errors.py Batch error-fixer with hardware-reference grounding (V100 / A100 / H100 / B200 / T4 specs as ground-truth context). To be mined for audit_corpus_batched.py --propose-fixes in Phase 5. deep_verify.py Claude Opus + extended thinking; SHOWS ITS WORK on every napkin-math claim. Useful as a tiebreaker on borderline math findings from the lightweight audit. Each restored file has a 5-line STATUS comment block at the top documenting what to adapt before running. DEPRECATED.md is restructured to make the three categories explicit (removed / preserved-for-adaptation / active-migration), and adds an adaptation checklist that applies to all preserved scripts (replace corpus.json loading, verify SDK pins, update output paths, re-validate prompts, sample first). Validation: vault check --strict — 10,711 loaded, 0 invariant failures pytest — 74/74 ruff — clean	2026-05-03 07:50:28 -04:00
Vijay Janapa Reddi	56d3ed1551	chore(vault): remove 18 deprecated scripts per CORPUS_HARDENING_PLAN.md Phase 0 All 18 scripts pre-date the YAML-as-source-of-truth migration (ARCHITECTURE.md v2.x, Phase 1) and are listed in DEPRECATED.md's replaced-by table. The corpus.json they ran against is itself now a build artifact (gitignored, regenerated by `vault build --local-json`). Removed top-level (13): build_corpus.py → vault build (walks YAML, emits vault.db) export_to_staffml.py → vault build --local-json extract_taxonomy.py → vault/taxonomy.yaml deep_verify.py → audit_chains_with_gemini.py + validate_drafts.py gemini_*.py × 6 → Phase-7 vault generate / batched audit pipeline gpt_backfill_question.py gate.py → obsolete after schema v1.0 generate.py → vault generate Removed archive/ (5): expand_tracks.py, fill_zone_gaps.py, fill_gaps.sh, final_balance.sh, README.md (now-orphan). DEPRECATED.md updated: replaced-by table reorganized as a removal log for git-archaeology, with a note that historical implementations are findable via `git log --diff-filter=D`. Validation: vault check --strict — 10,711 loaded, 0 invariant failures pytest interviews/vault-cli/tests/ — 74/74 ruff check interviews/vault-cli — clean This is Phase 0 of CORPUS_HARDENING_PLAN.md.	2026-05-03 07:44:13 -04:00
Vijay Janapa Reddi	9fdbfb9a4c	refactor(vault-cli): rename --legacy-json to --local-json The flag is the StaffML frontend's local-dev fallback (read corpus.json from disk via NEXT_PUBLIC_VAULT_FALLBACK=static), not a deprecated path. "Legacy" implied "soon to be removed"; "local-json" describes its actual role and reads correctly in scripts and docs. - vault-cli: rename CLI flag, parameter, result key, and help text. - CI workflows + pre-commit config: invoke the new flag name. - All scripts that print the command (suggest_exemplars, pre_commit_corpus_guard, promote_validated, rename_legacy_ids, export_to_staffml, the paper analyze_corpus/generate_*) updated. - Comments and docs (ARCHITECTURE, CHANGELOG, REVIEWS, TESTING, MASSIVE_BUILD_RUNBOOK, DEPRECATED, AUTHORING, plus frontend comments and .env.example / .gitignore) updated. The "legacy_json" sentinel string in corpus_stats.json._meta.source is intentionally NOT renamed — it is a stable artifact format read by downstream paper-generation tooling.	2026-04-30 09:30:28 -04:00
Vijay Janapa Reddi	cbdb566381	feat(vault): Phase-1 migration contract fully closed in-repo v2.3 \u2192 v2.4. ARCHITECTURE.md header + Appendix reflect the completed migration. WHAT CLOSED (\u00a711.1 contract): 1. `vault build --legacy-json` regenerates the site's interviews/staffml/src/data/corpus.json from YAML. 9,199 published questions, site-compatible shape (chain_positions back to 0-indexed dict form, bloom_level derived from zone, competency_area aliased from topic, scope aliased from track). Deterministic via sort_keys + id-sort. 2. Pre-commit hook INSTALLED via worktree-aware Makefile target (`make -C interviews/vault-cli hooks`). Symlink points at pre_commit_corpus_guard.py. Tested end-to-end: direct edit to vault/corpus.json triggers exit-1 with §11.1 reference. 3. CI equivalence check added to .github/workflows/vault-ci.yml: regenerates corpus.json from YAML, diffs against committed. Fails PR on drift with actionable error message. 4. Legacy generators demoted with DEPRECATED headers: - interviews/paper/scripts/analyze_corpus.py \u2192 vault export-paper - interviews/staffml/scripts/sync-vault.py \u2192 vault build --legacy-json - interviews/staffml/scripts/generate-manifest.py \u2192 vault publish - interviews/vault/scripts/export_to_staffml.py \u2192 vault build --legacy-json 5. New DEPRECATED.md files at interviews/vault/scripts/ and interviews/staffml/scripts/ map every legacy script to its replacement. Both directories keep the old scripts for git-history legibility and archaeology; new contributors see the vault CLI first. 6. ARCHITECTURE.md \u00a7Appendix rewritten as current-state table instead of aspirational "gone. replaced by..." entries. NEW TESTS (interviews/vault-cli/tests/test_legacy_export.py \u2014 +4): - test_legacy_shape_matches_site_interface: every field corpus.ts declares is present in regenerated JSON. - test_chain_positions_legacy_shape: 1-indexed new schema \u2192 0-indexed legacy dict form. - test_emitter_deterministic: byte-stable across reversed input order (required for CI diff-check). - test_competency_area_aliases_topic: legacy alias fields populated correctly. FULL MATRIX GREEN: pytest: 38/38 passed in 0.19s (34 + 4 legacy-export) ruff: All checks passed hook: exit 0 on clean diff / exit 1 on corpus.json direct edit e2e: vault build --legacy-json regenerates a bit-identical corpus.json vs the committed one; CI check wired to catch drift WHAT'S LEFT (deploy-gated, \u00a720.5 #1, #5, #6 partial, #8, #9): - Production serves from D1: requires Phase-3 wrangler d1 create + deploy - Manual QA per CUTOVER_QA.md: requires live staging - Zero data loss D1-side verification: requires live D1 - 48h monitoring: requires production traffic These are intrinsically user-action; the YAML-side migration is done.	2026-04-16 14:57:24 -04:00

4 Commits