mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-06 17:49:07 -05:00
Adds two subcommands and supporting modules:
vault chains audit
Reports chain health: orphans, position-drift (gaps from filtered
members), stale-registry, intra-chain cosine distribution, weakest
chains list. Embedding-aware via --no-embeddings escape hatch.
vault chains suggest
For each orphan singleton, ranks rescue candidates within the same
(track, topic) bucket. Hybrid scoring:
HARD filter: level_delta in {0, 1, 2} (matches 92% of observed
chain edges across the corpus)
SOFT rank: embedding cosine + delta=1 priority
Bands: strong-merge / review-merge / below-threshold
Embeddings: bge-small-en-v1.5 (BAAI). Calibrated via
scripts/calibrate_chain_embeddings.py against the 726 healthy chains.
Empirical findings (in script header docstring):
- bge-small precision@1 = 0.283, recall@3 = 0.447
- bge-large gains only +0.013 P@1 at 7x embedding time — not worth it
- Same-bucket questions are inherently close (μ_pos=0.785, μ_neg=0.757);
so this is suggestion-only, never auto-apply.
Cross-encoder rerank experiment script included for future research
(BAAI/bge-reranker-base) — current run OOM'd on 16GB; deferred.
Embedding cache (.npz) is gitignored — reproducible from source.