mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 02:03:55 -05:00
Detects chain members that have drifted semantically away from their
chain mates after an edit. Re-embeds changed YAMLs with the same model
the corpus uses (BAAI/bge-small-en-v1.5) and reports the min cosine to
each chain mate.
Default invocation (advisory):
python3 scripts/check_chain_decay.py
# diffs against origin/dev, flags chains with min mate-cosine < 0.40
Other modes:
--files <a.yaml> <b.yaml> explicit files instead of git diff
--base HEAD~5 different base ref
--threshold 0.50 tighter cutoff (slow drift detection)
--strict exit non-zero on flag (use as CI gate)
Default is advisory not blocking — first ship intentionally doesn't
fail commits or CI. The threshold 0.40 is calibrated against the
post-Phase-1 corpus; tune as needed once you've seen what real-edit
deltas look like in practice.
Implementation notes:
- Reuses embeddings.npz for chain-mate vectors (no re-embedding the
whole corpus per run).
- Only the changed question gets re-embedded — fast for typical
PR-sized changes.
- Skips changed questions that aren't in chains; skips chain
memberships where the mate isn't in embeddings.npz (e.g., the
Phase 3 promoted drafts before they hit the next embedding rebuild).
Smoke checks:
- --base origin/dev finds 4 changed YAMLs (the Phase 3 promoted
drafts), correctly reports no chain memberships (those questions
aren't in chains.json yet — by design, gated on human review).
- --files <cloud-2520.yaml> on a real chain member: cos=0.79 vs
its L5 mate cloud-2521 (well above 0.40 threshold ✓).