Files
cs249r_book/interviews/vault/draft-validation-scorecard.json
Vijay Janapa Reddi 9f83d3e8a6 feat(vault): Phase 3 pilot — 5 gaps generated, 4 promoted as drafts
Pilot run of the Phase 3 authoring tooling on a 5-gap subset (sized
down from the roadmap's 30 to keep wall-time + Gemini-call budget
reasonable for an unsupervised run).

Pilot scope:
  Selected 5 high-value gaps from gaps.proposed.lenient.json — buckets
  with ≥4 published questions, biased toward low-density tracks. All 5
  picks landed in edge/mobile.

Phase 3.c — generate (5/5 written):
  edge-2535  edge/latency-decomposition L?→L3
  edge-2536  edge/pruning-sparsity L?→L4
  edge-2537  edge/tco-cost-modeling L?→L3
  mobile-2146  mobile/duty-cycling L?→L3
  mobile-2147  mobile/model-format-conversion L?→L2

Phase 3.b validation — 4/5 pass (80% — above roadmap's 60-75% target):
  edge-2535: FAIL on originality (cos=0.933 vs edge-1883, threshold 0.92)
  edge-2536: pass on all 4 gates
  edge-2537: pass on all 4 gates
  mobile-2146: pass on all 4 gates
  mobile-2147: pass on all 4 gates

The originality gate correctly caught a draft that was too similar
to one of its bridge anchors — exactly the failure mode it was
designed for. Gates were run on schema (Pydantic), originality
(BAAI/bge-small-en-v1.5 cosine vs in-bucket neighbours, threshold
0.92), level_fit (Gemini-judge against same-level exemplars),
coherence (Gemini-judge), and bridge (Gemini-judge against the gap
anchors).

Phase 3.d — promotion (4 passing drafts):
  - .yaml.draft → .yaml rename
  - _authoring stripped; replaced with proper schema fields:
      provenance: llm-draft
      status: draft  (NOT published — gating on human review)
      authors: [gemini-3.1-pro-preview]
      human_reviewed: { status: not-reviewed }
      tags: + gap-bridge:<from>-<to>
  - id-registry.yaml appended (append-only ledger preserved)
  - edge-2535.yaml.draft kept in place for the human reviewer's
    disposition (rewrite + retry vs delete)

Validation post-promotion:
  - vault check --strict: 10,705 loaded (was 10,701; +4 ✓), 0 failures
  - vault build --legacy-json: released set unchanged
    (status=draft excluded by release-policy.yaml's published filter)
    — releaseHash and chainCount intentionally stable until human
    review flips status

Phase 3.e (chain rebuild) deferred: drafts must clear human review
and flip to status: published before they're eligible for chain
membership. Runbook in CHAIN_ROADMAP.md Progress Log.

Cost: 5 generation + 15 judge = 20 Gemini calls.
2026-05-01 13:38:18 -04:00

152 lines
6.8 KiB
JSON

{
"generated_at": "2026-05-01T17:34:46+00:00",
"originality_threshold": 0.92,
"drafts_evaluated": 5,
"passes": 4,
"fails": 1,
"errors": 0,
"rows": [
{
"path": "interviews/vault/questions/edge/cross-cutting/edge-2537.yaml.draft",
"draft_id": "edge-2537",
"track": "edge",
"topic": "tco-cost-modeling",
"level": "L3",
"schema_ok": true,
"originality": "pass",
"originality_detail": {
"top_neighbour": "edge-1169",
"cosine": 0.8187,
"threshold": 0.92,
"bucket_size": 34
},
"level_fit": "pass",
"level_fit_detail": {
"rationale": "The candidate question requires straightforward application of given values to calculate data transmission costs and savings over time, matching the quantitative application (L3) cognitive demand seen in the exemplars."
},
"coherence": "pass",
"coherence_detail": {
"rationale": "The calculations accurately compute monthly data usage based on a 30-day month and 1,000,000 KB per GB, resulting in $7,500 for Option A, $150 for Option B, and exactly $88,200 in annual savings."
},
"bridge": "pass",
"bridge_detail": {
"rationale": "The candidate logically bridges the progression by calculating the quantitative difference between streaming raw data versus local processing (L3), connecting the introductory concept of cellular data costs (L1) to the advanced diagnostic scenario of fixing an architectural data overage (L4)."
},
"verdict": "pass"
},
{
"path": "interviews/vault/questions/edge/latency/edge-2535.yaml.draft",
"draft_id": "edge-2535",
"track": "edge",
"topic": "latency-decomposition",
"level": "L3",
"schema_ok": true,
"originality": "fail",
"originality_detail": {
"top_neighbour": "edge-1883",
"cosine": 0.9328,
"threshold": 0.92,
"bucket_size": 34
},
"originality_reason": "too similar to edge-1883 (cosine=0.933 >= 0.92)",
"level_fit": "pass",
"level_fit_detail": {
"rationale": "The candidate requires applying computational formulas to calculate theoretical latency and using that result to deduce system bottlenecks, aligning perfectly with the L3 application of latency decomposition principles seen in the exemplars."
},
"coherence": "pass",
"coherence_detail": {
"rationale": "The solution correctly identifies that the theoretical latency is a fraction of a millisecond (0.15 ms) and logically attributes the 60ms measured latency to host-side bottlenecks, fully addressing the prompt."
},
"bridge": "pass",
"bridge_detail": {
"rationale": "The candidate perfectly bridges the L2 identification of latency stages and the L4 optimization of host bottlenecks by having the learner calculate the theoretical TPU compute time to prove that host-side overhead dominates."
},
"verdict": "fail"
},
{
"path": "interviews/vault/questions/edge/optimization/edge-2536.yaml.draft",
"draft_id": "edge-2536",
"track": "edge",
"topic": "pruning-sparsity",
"level": "L4",
"schema_ok": true,
"originality": "pass",
"originality_detail": {
"top_neighbour": "edge-1957",
"cosine": 0.9046,
"threshold": 0.92,
"bucket_size": 34
},
"level_fit": "pass",
"level_fit_detail": {
"rationale": "The candidate perfectly mirrors the analytical depth of exemplar edge-0093 by requiring the candidate to analyze the mismatch between unstructured sparsity and dense matrix multiplication hardware (systolic arrays)."
},
"coherence": "pass",
"coherence_detail": {
"rationale": "The scenario, question, and solution are perfectly aligned, accurately addressing the hardware limitation of dense systolic arrays when encountering unstructured sparsity."
},
"bridge": "pass",
"bridge_detail": {
"rationale": "The candidate smoothly bridges the L3 identification of structured pruning and the L5 strategic application by providing an L4 diagnostic analysis of why unstructured pruning fails on the Coral TPU's systolic architecture."
},
"verdict": "pass"
},
{
"path": "interviews/vault/questions/mobile/deployment/mobile-2147.yaml.draft",
"draft_id": "mobile-2147",
"track": "mobile",
"topic": "model-format-conversion",
"level": "L2",
"schema_ok": true,
"originality": "pass",
"originality_detail": {
"top_neighbour": "mobile-1022",
"cosine": 0.8858,
"threshold": 0.92,
"bucket_size": 34
},
"level_fit": "pass",
"level_fit_detail": {
"rationale": "The candidate question requires basic understanding of precision sizes (FP32 vs FP16) and a straightforward calculation to determine storage footprint, perfectly aligning with the L2 comprehension and calculation level demonstrated in the exemplars."
},
"coherence": "pass",
"coherence_detail": {
"rationale": "The calculation accurately determines the storage footprint based on parameter count and data type sizes, reducing 60 MB to 30 MB, perfectly addressing the scenario and question."
},
"bridge": "pass",
"bridge_detail": {
"rationale": "The candidate introduces the mathematical calculation for FP16 parameter sizing in a PyTorch-to-CoreML context at L2, perfectly bridging the L1 format compatibility recall and the L3 pipeline execution that requires an unprompted sizing calculation."
},
"verdict": "pass"
},
{
"path": "interviews/vault/questions/mobile/power/mobile-2146.yaml.draft",
"draft_id": "mobile-2146",
"track": "mobile",
"topic": "duty-cycling",
"level": "L3",
"schema_ok": true,
"originality": "pass",
"originality_detail": {
"top_neighbour": "mobile-0341",
"cosine": 0.8463,
"threshold": 0.92,
"bucket_size": 34
},
"level_fit": "pass",
"level_fit_detail": {
"rationale": "The candidate aligns perfectly with the L3 exemplars by requiring the direct application of power-time-energy formulas to calculate total energy consumption across distinct operational phases."
},
"coherence": "pass",
"coherence_detail": {
"rationale": "The scenario clearly defines the power draw and duration for each phase, which the solution accurately uses to calculate the total energy per cycle and scales correctly for a 1-hour session."
},
"bridge": "pass",
"bridge_detail": {
"rationale": "The candidate seamlessly bridges the L2 simple duty-cycle calculation and the L4 thermal analysis by adding the L3 complexity of transient wake-up overhead within the established dashcam scenario."
},
"verdict": "pass"
}
]
}