cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-06 17:49:07 -05:00

Files

Vijay Janapa Reddi e43ff34719 feat(vault-cli): chain audit + rescue suggestions with embedding similarity

Adds two subcommands and supporting modules:

  vault chains audit
    Reports chain health: orphans, position-drift (gaps from filtered
    members), stale-registry, intra-chain cosine distribution, weakest
    chains list. Embedding-aware via --no-embeddings escape hatch.

  vault chains suggest
    For each orphan singleton, ranks rescue candidates within the same
    (track, topic) bucket. Hybrid scoring:
      HARD filter: level_delta in {0, 1, 2} (matches 92% of observed
                   chain edges across the corpus)
      SOFT rank:   embedding cosine + delta=1 priority
      Bands:       strong-merge / review-merge / below-threshold

Embeddings: bge-small-en-v1.5 (BAAI). Calibrated via
scripts/calibrate_chain_embeddings.py against the 726 healthy chains.
Empirical findings (in script header docstring):
  - bge-small precision@1 = 0.283, recall@3 = 0.447
  - bge-large gains only +0.013 P@1 at 7x embedding time — not worth it
  - Same-bucket questions are inherently close (μ_pos=0.785, μ_neg=0.757);
    so this is suggestion-only, never auto-apply.

Cross-encoder rerank experiment script included for future research
(BAAI/bge-reranker-base) — current run OOM'd on 16GB; deferred.

Embedding cache (.npz) is gitignored — reproducible from source.

2026-04-29 19:00:09 -04:00