feat(vault): Phase 3 pilot — 5 gaps generated, 4 promoted as drafts

Pilot run of the Phase 3 authoring tooling on a 5-gap subset (sized down from the roadmap's 30 to keep wall-time + Gemini-call budget reasonable for an unsupervised run). Pilot scope: Selected 5 high-value gaps from gaps.proposed.lenient.json — buckets with ≥4 published questions, biased toward low-density tracks. All 5 picks landed in edge/mobile. Phase 3.c — generate (5/5 written): edge-2535 edge/latency-decomposition L?→L3 edge-2536 edge/pruning-sparsity L?→L4 edge-2537 edge/tco-cost-modeling L?→L3 mobile-2146 mobile/duty-cycling L?→L3 mobile-2147 mobile/model-format-conversion L?→L2 Phase 3.b validation — 4/5 pass (80% — above roadmap's 60-75% target): edge-2535: FAIL on originality (cos=0.933 vs edge-1883, threshold 0.92) edge-2536: pass on all 4 gates edge-2537: pass on all 4 gates mobile-2146: pass on all 4 gates mobile-2147: pass on all 4 gates The originality gate correctly caught a draft that was too similar to one of its bridge anchors — exactly the failure mode it was designed for. Gates were run on schema (Pydantic), originality (BAAI/bge-small-en-v1.5 cosine vs in-bucket neighbours, threshold 0.92), level_fit (Gemini-judge against same-level exemplars), coherence (Gemini-judge), and bridge (Gemini-judge against the gap anchors). Phase 3.d — promotion (4 passing drafts): - .yaml.draft → .yaml rename - _authoring stripped; replaced with proper schema fields: provenance: llm-draft status: draft (NOT published — gating on human review) authors: [gemini-3.1-pro-preview] human_reviewed: { status: not-reviewed } tags: + gap-bridge:<from>-<to> - id-registry.yaml appended (append-only ledger preserved) - edge-2535.yaml.draft kept in place for the human reviewer's disposition (rewrite + retry vs delete) Validation post-promotion: - vault check --strict: 10,705 loaded (was 10,701; +4 ✓), 0 failures - vault build --legacy-json: released set unchanged (status=draft excluded by release-policy.yaml's published filter) — releaseHash and chainCount intentionally stable until human review flips status Phase 3.e (chain rebuild) deferred: drafts must clear human review and flip to status: published before they're eligible for chain membership. Runbook in CHAIN_ROADMAP.md Progress Log. Cost: 5 generation + 15 judge = 20 Gemini calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 17:49:07 -05:00 · 2026-05-01 13:38:18 -04:00
parent 84b1fab082
commit a750ab7bce
8 changed files with 587 additions and 2 deletions
--- a/interviews/vault-cli/docs/CHAIN_ROADMAP.md
+++ b/interviews/vault-cli/docs/CHAIN_ROADMAP.md
@@ -3,7 +3,7 @@
 **Status:** active workstream
 **Branch:** `yaml-audit` (off `dev`)
 **Worktree:** `/Users/VJ/GitHub/MLSysBook-yaml-audit`
-**Last updated:** 2026-05-01 (Phase 3.a + 3.b tooling shipped; 3.c pilot deferred for review)
+**Last updated:** 2026-05-01 (Phase 3.c pilot run + 3.d promotion shipped; 4 new draft questions in corpus, awaiting human review)

 This document is the canonical resumable plan for the vault chain rebuild
 + corpus growth work. **Future Claude sessions: read the "Resume Here"
@@ -368,7 +368,7 @@ primary chains in default surfaces, exposes secondary in "more paths."

 ## Phase 3 — Gap-driven question authoring

-**Status:** `tooling complete (3.a + 3.b); pilot 3.c deferred for review`
+**Status:** `pilot run shipped (3.c + 3.d); 3.e gated on human review of drafts`
 **Goal:** Use the 138+ entries in `gaps.proposed.json` to author new
 questions filling missing rungs, validated independently before commit.
 This is the durable corpus growth strategy.
@@ -1064,5 +1064,123 @@ with the user available to spot-check the first few outputs).

 ---

+### 2026-05-01 — Phase 3.c + 3.d: pilot run + promotion (5 gaps)
+
+**Pilot scope (sized down from the roadmap's 30):** 5 high-value gaps,
+selected from `gaps.proposed.lenient.json` favoring (track, topic)
+buckets with ≥4 published questions and biased toward low-density
+tracks. All 5 picks landed in edge/mobile (the densities the lenient
+sweep most needed help on).
+
+**Phase 3.c — generate (`generate_question_for_gap.py`):**
+| target | gap | result |
+|---|---|---|
+| edge-2535 | edge/latency-decomposition L?→L3 between=[edge-1883, edge-1701] | written |
+| edge-2536 | edge/pruning-sparsity L?→L4 between=[edge-1960, edge-1957] | written |
+| edge-2537 | edge/tco-cost-modeling L?→L3 between=[edge-0731, edge-1154] | written |
+| mobile-2146 | mobile/duty-cycling L?→L3 between=[mobile-0367, mobile-2034] | written |
+| mobile-2147 | mobile/model-format-conversion L?→L2 between=[mobile-0984, mobile-1022] | written |
+
+5/5 generated cleanly. Each draft passed Pydantic schema validation
+inline (the `assemble_draft` → `Question.model_validate` gate); none
+were rejected at the file-write step.
+
+Spot-checking `edge-2535`: realistic ML-systems scenario (Coral USB
+TPU + MobileNetV2-SSD + INT8), concrete numbers, calculation-driven
+question consistent with L3/apply, solution gets at the actual
+insight (host-side bottleneck). Other 4 are similarly competent.
+
+**Phase 3.b run — `validate_drafts.py`:**
+
+| draft | originality | level_fit | coherence | bridge | verdict |
+|---|---|---|---|---|---|
+| edge-2535 | **fail** (cos=0.933 vs edge-1883) | pass | pass | pass | **fail** |
+| edge-2536 | pass | pass | pass | pass | **pass** |
+| edge-2537 | pass | pass | pass | pass | **pass** |
+| mobile-2146 | pass | pass | pass | pass | **pass** |
+| mobile-2147 | pass | pass | pass | pass | **pass** |
+
+**4/5 pass = 80% pass rate** (above the roadmap's 60-75% estimate).
+The one fail was correctly caught — `edge-2535`'s draft scenario
+turned out too similar to one of its between-questions
+(`edge-1883`), cosine 0.933 over the 0.92 threshold. This is the
+gate working as designed: Gemini occasionally drafts a "bridge" that's
+just a paraphrase of one of its anchors instead of a true L3
+intermediate. The gate filtered it.
+
+**Phase 3.d — promotion (4 passing drafts):**
+- `.yaml.draft` → `.yaml` rename for the 4 passes.
+- `_authoring` private metadata stripped at promotion; replaced with:
+  - `provenance: llm-draft`
+  - `status: draft` (not `published` — gating on human review)
+  - `authors: ["gemini-3.1-pro-preview"]`
+  - `human_reviewed: { status: not-reviewed, ... }` so the
+    not-yet-reviewed state is honest and machine-checkable.
+  - `tags`: original tags preserved + a new `gap-bridge:<from>-<to>`
+    tag so these can be queried later.
+- IDs appended to `id-registry.yaml`: `edge-2536`, `edge-2537`,
+  `mobile-2146`, `mobile-2147` — created_by `generate_question_for_gap.py`.
+- `edge-2535.yaml.draft` was **kept in place** (still .yaml.draft).
+  Decision for the human reviewer when they triage: rewrite + retry,
+  or delete.
+
+**Validation post-promotion:**
+- `vault check --strict` → 10,705 loaded (was 10,701; +4 ✓), 0 invariant
+  failures.
+- `vault build --legacy-json` → released set unchanged: 9438 published,
+  chainCount=879, releaseHash=04ee8a23… (drafts have status=draft, so
+  the publishing filter excludes them — by design).
+
+**Phase 3.e — chain rebuild (deferred):**
+Skipped tonight. The new questions are `status: draft` and the
+chain-builder filters on published, so a rebuild wouldn't pick them
+up. The right sequence is: human reviews the 4 drafts → flips status
+to `published` (and `human_reviewed.status` to `verified`) → then
+re-runs `build_chains_with_gemini.py --all`. At that point chainCount
+is expected to grow modestly (the 4 new questions were authored TO
+fit chains, so they should land in their bridge slots).
+
+**Files changed in the Phase 3 pilot commit:**
+- `interviews/vault/questions/edge/cross-cutting/edge-2537.yaml` (new)
+- `interviews/vault/questions/edge/optimization/edge-2536.yaml` (new)
+- `interviews/vault/questions/mobile/deployment/mobile-2147.yaml` (new)
+- `interviews/vault/questions/mobile/power/mobile-2146.yaml` (new)
+- `interviews/vault/questions/edge/latency/edge-2535.yaml.draft` (new — failed validation, awaiting reviewer disposition)
+- `interviews/vault/draft-validation-scorecard.json` (new — per-row record)
+- `interviews/vault/id-registry.yaml` (4 appended entries)
+- `interviews/vault-cli/docs/CHAIN_ROADMAP.md` (this entry)
+
+**Notes for next session — review checklist:**
+1. Read each of the 4 promoted drafts. Spot-checks suggest they're
+   competent but cognitive-load calibration is the place where Gemini
+   drift is most likely. Each scorecard row has the `level_fit` rationale
+   from the LLM judge — those are first-cut signals, not authoritative.
+2. For the failed `edge-2535`: read it next to its high-cosine
+   neighbour (`edge-1883`). If it's too duplicative as the originality
+   gate suggests, delete; if it's actually distinct enough, edit and
+   re-validate (you can re-run `validate_drafts.py` after editing).
+3. Once you're happy with N drafts, flip their `status: draft → published`
+   and `human_reviewed.status → verified`, set `human_reviewed.by`, then:
+   ```bash
+   vault check --strict
+   vault build --legacy-json    # released question count goes up by N
+   python3 interviews/vault-cli/scripts/build_chains_with_gemini.py --all \
+     --output interviews/vault/chains.proposed.json
+   python3 interviews/vault-cli/scripts/apply_proposed_chains.py
+   ```
+4. If the pilot's 80% rate holds at scale, a 30-gap batch would land
+   ~24 promotable drafts and absorb ~12-15 of them into chains
+   (chain rebuild typically picks up ~50% of new questions per the
+   roadmap).
+
+**Cost note:** This pilot used 5 generation calls + 5 × 3 judge calls = 20 Gemini
+calls. A 30-gap batch would be ~120 calls (still under the 250/day cap but
+worth budgeting around).
+
+**Next step:** Phase 3.e — chain rebuild. Gated on human review of the
+4 drafts now in the tree.
+
+---
+
 <!-- Append new entries above this comment, in reverse chronological is fine,
     but keep entries dated and self-contained for resume context. -->
--- a/interviews/vault/draft-validation-scorecard.json
+++ b/interviews/vault/draft-validation-scorecard.json
@@ -0,0 +1,151 @@
+{
+  "generated_at": "2026-05-01T17:34:46+00:00",
+  "originality_threshold": 0.92,
+  "drafts_evaluated": 5,
+  "passes": 4,
+  "fails": 1,
+  "errors": 0,
+  "rows": [
+    {
+      "path": "interviews/vault/questions/edge/cross-cutting/edge-2537.yaml.draft",
+      "draft_id": "edge-2537",
+      "track": "edge",
+      "topic": "tco-cost-modeling",
+      "level": "L3",
+      "schema_ok": true,
+      "originality": "pass",
+      "originality_detail": {
+        "top_neighbour": "edge-1169",
+        "cosine": 0.8187,
+        "threshold": 0.92,
+        "bucket_size": 34
+      },
+      "level_fit": "pass",
+      "level_fit_detail": {
+        "rationale": "The candidate question requires straightforward application of given values to calculate data transmission costs and savings over time, matching the quantitative application (L3) cognitive demand seen in the exemplars."
+      },
+      "coherence": "pass",
+      "coherence_detail": {
+        "rationale": "The calculations accurately compute monthly data usage based on a 30-day month and 1,000,000 KB per GB, resulting in $7,500 for Option A, $150 for Option B, and exactly $88,200 in annual savings."
+      },
+      "bridge": "pass",
+      "bridge_detail": {
+        "rationale": "The candidate logically bridges the progression by calculating the quantitative difference between streaming raw data versus local processing (L3), connecting the introductory concept of cellular data costs (L1) to the advanced diagnostic scenario of fixing an architectural data overage (L4)."
+      },
+      "verdict": "pass"
+    },
+    {
+      "path": "interviews/vault/questions/edge/latency/edge-2535.yaml.draft",
+      "draft_id": "edge-2535",
+      "track": "edge",
+      "topic": "latency-decomposition",
+      "level": "L3",
+      "schema_ok": true,
+      "originality": "fail",
+      "originality_detail": {
+        "top_neighbour": "edge-1883",
+        "cosine": 0.9328,
+        "threshold": 0.92,
+        "bucket_size": 34
+      },
+      "originality_reason": "too similar to edge-1883 (cosine=0.933 >= 0.92)",
+      "level_fit": "pass",
+      "level_fit_detail": {
+        "rationale": "The candidate requires applying computational formulas to calculate theoretical latency and using that result to deduce system bottlenecks, aligning perfectly with the L3 application of latency decomposition principles seen in the exemplars."
+      },
+      "coherence": "pass",
+      "coherence_detail": {
+        "rationale": "The solution correctly identifies that the theoretical latency is a fraction of a millisecond (0.15 ms) and logically attributes the 60ms measured latency to host-side bottlenecks, fully addressing the prompt."
+      },
+      "bridge": "pass",
+      "bridge_detail": {
+        "rationale": "The candidate perfectly bridges the L2 identification of latency stages and the L4 optimization of host bottlenecks by having the learner calculate the theoretical TPU compute time to prove that host-side overhead dominates."
+      },
+      "verdict": "fail"
+    },
+    {
+      "path": "interviews/vault/questions/edge/optimization/edge-2536.yaml.draft",
+      "draft_id": "edge-2536",
+      "track": "edge",
+      "topic": "pruning-sparsity",
+      "level": "L4",
+      "schema_ok": true,
+      "originality": "pass",
+      "originality_detail": {
+        "top_neighbour": "edge-1957",
+        "cosine": 0.9046,
+        "threshold": 0.92,
+        "bucket_size": 34
+      },
+      "level_fit": "pass",
+      "level_fit_detail": {
+        "rationale": "The candidate perfectly mirrors the analytical depth of exemplar edge-0093 by requiring the candidate to analyze the mismatch between unstructured sparsity and dense matrix multiplication hardware (systolic arrays)."
+      },
+      "coherence": "pass",
+      "coherence_detail": {
+        "rationale": "The scenario, question, and solution are perfectly aligned, accurately addressing the hardware limitation of dense systolic arrays when encountering unstructured sparsity."
+      },
+      "bridge": "pass",
+      "bridge_detail": {
+        "rationale": "The candidate smoothly bridges the L3 identification of structured pruning and the L5 strategic application by providing an L4 diagnostic analysis of why unstructured pruning fails on the Coral TPU's systolic architecture."
+      },
+      "verdict": "pass"
+    },
+    {
+      "path": "interviews/vault/questions/mobile/deployment/mobile-2147.yaml.draft",
+      "draft_id": "mobile-2147",
+      "track": "mobile",
+      "topic": "model-format-conversion",
+      "level": "L2",
+      "schema_ok": true,
+      "originality": "pass",
+      "originality_detail": {
+        "top_neighbour": "mobile-1022",
+        "cosine": 0.8858,
+        "threshold": 0.92,
+        "bucket_size": 34
+      },
+      "level_fit": "pass",
+      "level_fit_detail": {
+        "rationale": "The candidate question requires basic understanding of precision sizes (FP32 vs FP16) and a straightforward calculation to determine storage footprint, perfectly aligning with the L2 comprehension and calculation level demonstrated in the exemplars."
+      },
+      "coherence": "pass",
+      "coherence_detail": {
+        "rationale": "The calculation accurately determines the storage footprint based on parameter count and data type sizes, reducing 60 MB to 30 MB, perfectly addressing the scenario and question."
+      },
+      "bridge": "pass",
+      "bridge_detail": {
+        "rationale": "The candidate introduces the mathematical calculation for FP16 parameter sizing in a PyTorch-to-CoreML context at L2, perfectly bridging the L1 format compatibility recall and the L3 pipeline execution that requires an unprompted sizing calculation."
+      },
+      "verdict": "pass"
+    },
+    {
+      "path": "interviews/vault/questions/mobile/power/mobile-2146.yaml.draft",
+      "draft_id": "mobile-2146",
+      "track": "mobile",
+      "topic": "duty-cycling",
+      "level": "L3",
+      "schema_ok": true,
+      "originality": "pass",
+      "originality_detail": {
+        "top_neighbour": "mobile-0341",
+        "cosine": 0.8463,
+        "threshold": 0.92,
+        "bucket_size": 34
+      },
+      "level_fit": "pass",
+      "level_fit_detail": {
+        "rationale": "The candidate aligns perfectly with the L3 exemplars by requiring the direct application of power-time-energy formulas to calculate total energy consumption across distinct operational phases."
+      },
+      "coherence": "pass",
+      "coherence_detail": {
+        "rationale": "The scenario clearly defines the power draw and duration for each phase, which the solution accurately uses to calculate the total energy per cycle and scales correctly for a 1-hour session."
+      },
+      "bridge": "pass",
+      "bridge_detail": {
+        "rationale": "The candidate seamlessly bridges the L2 simple duty-cycle calculation and the L4 thermal analysis by adding the L3 complexity of transient wake-up overhead within the established dashcam scenario."
+      },
+      "verdict": "pass"
+    }
+  ]
+}
--- a/interviews/vault/id-registry.yaml
+++ b/interviews/vault/id-registry.yaml
@@ -15221,3 +15221,7 @@ entries:
  - {id: tinyml-1822, created_at: 2026-04-25T22:29:20+00:00, created_by: registry-rebuild-2026-04-25}
  - {id: tinyml-1823, created_at: 2026-04-25T22:29:20+00:00, created_by: registry-rebuild-2026-04-25}
  - {id: tinyml-1824, created_at: 2026-04-25T22:29:20+00:00, created_by: registry-rebuild-2026-04-25}
+  - {id: edge-2537, created_at: 2026-05-01T17:35:39+00:00, created_by: generate_question_for_gap.py}
+  - {id: edge-2536, created_at: 2026-05-01T17:35:39+00:00, created_by: generate_question_for_gap.py}
+  - {id: mobile-2147, created_at: 2026-05-01T17:35:39+00:00, created_by: generate_question_for_gap.py}
+  - {id: mobile-2146, created_at: 2026-05-01T17:35:39+00:00, created_by: generate_question_for_gap.py}
--- a/interviews/vault/questions/edge/cross-cutting/edge-2537.yaml
+++ b/interviews/vault/questions/edge/cross-cutting/edge-2537.yaml
@@ -0,0 +1,72 @@
+schema_version: '1.0'
+id: edge-2537
+track: edge
+level: L3
+zone: fluency
+topic: tco-cost-modeling
+competency_area: cross-cutting
+bloom_level: apply
+phase: inference
+title: 'Edge TCO Fluency: Monthly Cellular Data Cost Calculation'
+scenario: A fleet of 5,000 edge traffic monitors uses a cellular plan costing $5/GB. Each unit logs 100
+  events per day. Option A transmits a 100KB image per event to the cloud for processing. Option B runs
+  the model locally and transmits a 2KB JSON metadata payload.
+question: What is the total monthly cellular data cost for the fleet under both options, and what are
+  the annual operational savings of choosing Option B?
+details:
+  realistic_solution: Option A costs $7,500 per month, while Option B costs $150 per month. By choosing
+    local processing and sending only metadata (Option B), the fleet saves $88,200 annually in connectivity
+    operations costs.
+  common_mistake: '**The Pitfall:** Overlooking the scale multiplier when calculating edge OpEx.
+
+    **The Rationale:** Candidates might correctly calculate the data cost for a single device or a single
+    day but fail to multiply by the 5,000-unit fleet size and 30-day month for the monthly total, or the
+    12 months for the annual savings.
+
+    **The Consequence:** Proposing architectures that appear cheap on a per-unit basis but become prohibitively
+    expensive at scale, blowing out the OpEx budget.'
+  napkin_math: '**Assumptions & Constraints:**
+
+    Assume standard base-10 networking prefixes (1 GB = 1,000,000 KB). Assume 1 month = 30 days.
+
+
+    **Calculations:**
+
+    Events per month per unit: 100 events/day × 30 days = 3,000 events/month.
+
+    Total fleet events per month: 3,000 × 5,000 = 15,000,000 events.
+
+    Option A data: 15,000,000 × 100KB = 1,500,000,000 KB = 1,500 GB.
+
+    Option A cost: 1,500 GB × $5/GB = $7,500/month.
+
+    Option B data: 15,000,000 × 2KB = 30,000,000 KB = 30 GB.
+
+    Option B cost: 30 GB × $5/GB = $150/month.
+
+    Monthly savings: $7,500 - $150 = $7,350.
+
+    Annual savings: $7,350 × 12 = $88,200.
+
+
+    **Conclusion:**
+
+    Option B provides an $88,200 annual savings, illustrating how edge compute dramatically reduces connectivity
+    OpEx at scale.'
+status: draft
+provenance: llm-draft
+requires_explanation: false
+expected_time_minutes: 10
+tags:
+- tco
+- cellular
+- opex
+- bandwidth
+- gap-bridge:edge-0731-edge-1154
+authors:
+- gemini-3.1-pro-preview
+human_reviewed:
+  status: not-reviewed
+  by: null
+  date: null
+created_at: '2026-05-01T17:27:13+00:00'
--- a/interviews/vault/questions/edge/latency/edge-2535.yaml.draft
+++ b/interviews/vault/questions/edge/latency/edge-2535.yaml.draft
@@ -0,0 +1,61 @@
+schema_version: '1.0'
+id: edge-2535
+track: edge
+level: L3
+zone: diagnosis
+topic: latency-decomposition
+competency_area: latency
+bloom_level: apply
+phase: inference
+title: Theoretical vs. Measured Latency on the Coral USB TPU
+scenario: A team deploys an INT8 quantized MobileNetV2-SSD model requiring 0.6 GOPS to a Coral Edge TPU
+  USB Accelerator, which is rated at 4 TOPS. The team measures an end-to-end frame processing latency
+  of 60ms and concludes the TPU compute is too slow for their workload.
+question: Calculate the theoretical minimum inference latency for this model on the Coral TPU. Based on
+  this calculation, what is the most likely cause of the 60ms measured latency?
+details:
+  realistic_solution: The theoretical compute latency is a fraction of a millisecond. The massive discrepancy
+    between the sub-millisecond compute time and the 60ms measured latency indicates that the bottleneck
+    is not the TPU hardware, but rather host-side operations such as image resizing, color space conversion,
+    USB transfer overhead, or CPU-bound post-processing.
+  common_mistake: '**The Pitfall:** Assuming the TPU compute is the bottleneck because the overall system
+    is slow, or miscalculating the unit conversion between GOPS and TOPS.
+
+    **The Rationale:** Engineers often conflate raw ''inference time'' with ''end-to-end latency'', forgetting
+    that a USB accelerator requires significant host CPU coordination, data marshaling, and PCI/USB bus
+    transfer times.
+
+    **The Consequence:** The team might waste weeks pruning or re-training the neural network architecture
+    when the actual fix involves optimizing host CPU preprocessing pipelines or leveraging hardware video
+    decoders.'
+  napkin_math: '**Assumptions & Constraints:** The model requires 0.6 Giga Operations (GOPS). The Coral
+    TPU provides 4 Tera Operations Per Second (TOPS), which is 4,000 GOPS.
+
+
+    **Calculations:** Theoretical latency = Model Operations / Hardware Throughput = 0.6 GOPS / 4000 GOPS
+    = 0.00015 seconds, or 0.15 milliseconds. Even factoring in a highly conservative 20% hardware utilization
+    rate due to memory bandwidth constraints, compute time is ~0.75ms.
+
+
+    **Conclusion:** The TPU computation accounts for roughly 1% of the 60ms end-to-end latency. The remaining
+    ~59ms is consumed by host-to-device I/O and CPU-bound pre/post-processing tasks.'
+status: draft
+provenance: llm-draft
+requires_explanation: false
+expected_time_minutes: 10
+tags:
+- coral-tpu
+- bottlenecks
+- latency-decomposition
+- napkin-math
+_authoring:
+  origin: gemini-3.1-pro-preview
+  tool: generate_question_for_gap.py
+  generated_at: '2026-05-01T17:25:51+00:00'
+  gap:
+    between:
+    - edge-1883
+    - edge-1701
+    missing_level: L3
+    rationale: Calculating expected inference latency versus actual measured pipeline latency on the Coral
+      TPU.
--- a/interviews/vault/questions/edge/optimization/edge-2536.yaml
+++ b/interviews/vault/questions/edge/optimization/edge-2536.yaml
@@ -0,0 +1,58 @@
+schema_version: '1.0'
+id: edge-2536
+track: edge
+level: L4
+zone: diagnosis
+topic: pruning-sparsity
+competency_area: optimization
+bloom_level: analyze
+phase: inference
+title: Diagnosing Zero Latency Gains from Unstructured Pruning on Coral TPU
+scenario: A team deployed a MobileNetV2 model onto a Google Coral Edge TPU. To reduce the baseline 80ms
+  inference latency, they applied magnitude-based unstructured pruning, achieving 75% sparsity while maintaining
+  acceptable accuracy. After compiling with the Edge TPU Compiler and deploying the quantized INT8 model,
+  the inference latency remains stubbornly stuck at 80ms.
+question: Why did the 75% unstructured sparsity fail to yield any latency improvements on the Coral Edge
+  TPU, and what architectural characteristic of the accelerator dictates this outcome?
+details:
+  realistic_solution: The Coral Edge TPU relies on a dense systolic array architecture optimized for bulk,
+    contiguous INT8 matrix multiplications. Unstructured pruning zeroes out scattered weights but does
+    not change the physical dimensions of the tensors. Because the TPU lacks specialized hardware to skip
+    unstructured zeros, it executes the full dense MAC grid—simply multiplying by zero—which consumes
+    the exact same number of clock cycles.
+  common_mistake: '**The Pitfall:** Assuming that a reduction in theoretical FLOPs automatically translates
+    to lower inference latency on any hardware.
+
+    **The Rationale:** Developers often rely on software-level metrics like model size or parameter count,
+    neglecting how the specific hardware accelerator schedules and executes matrix operations.
+
+    **The Consequence:** The engineering team wastes valuable time retraining and fine-tuning an unstructured
+    sparse model that provides zero runtime benefit on the deployment edge device.'
+  napkin_math: '**Assumptions & Constraints:** A single convolutional layer takes 10ms to execute dense.
+    It is pruned to 75% unstructured sparsity. The Coral TPU executes MACs in fixed dense blocks.
+
+
+    **Calculations:** The compiler cannot shrink the tensor dimensions because the zeros are randomly
+    distributed. The TPU executes 100% of the MACs, where 75% of them happen to be operations with zero.
+    Execution time = 10ms * 1.0 (dense execution schedule) = 10ms.
+
+
+    **Conclusion:** Without structured pruning to physically reduce the number of channels or filters,
+    the tensor shapes remain identical, and the systolic array offers exactly 0% speedup.'
+status: draft
+provenance: llm-draft
+requires_explanation: false
+expected_time_minutes: 10
+tags:
+- coral-tpu
+- unstructured-pruning
+- latency-bottleneck
+- systolic-array
+- gap-bridge:edge-1960-edge-1957
+authors:
+- gemini-3.1-pro-preview
+human_reviewed:
+  status: not-reviewed
+  by: null
+  date: null
+created_at: '2026-05-01T17:26:16+00:00'
--- a/interviews/vault/questions/mobile/deployment/mobile-2147.yaml
+++ b/interviews/vault/questions/mobile/deployment/mobile-2147.yaml
@@ -0,0 +1,54 @@
+schema_version: '1.0'
+id: mobile-2147
+track: mobile
+level: L2
+zone: implement
+topic: model-format-conversion
+competency_area: deployment
+bloom_level: understand
+phase: inference
+title: 'Model Format Conversion: Sizing the FP16 CoreML Payload'
+scenario: Your team is preparing to convert a 15 million parameter computer vision model from PyTorch
+  to CoreML for an iOS app. The original model was trained and saved in standard FP32 precision. To comply
+  with strict App Store bundle limits, the conversion pipeline is configured to cast weights to FP16.
+question: How does the FP16 conversion mathematically impact the model's storage footprint, and what is
+  the expected payload size of the resulting CoreML model?
+details:
+  realistic_solution: Converting the weights from FP32 (4 bytes per parameter) to FP16 (2 bytes per parameter)
+    halves the required storage. For a 15 million parameter model, this reduces the disk footprint from
+    roughly 60 MB to an expected CoreML payload of 30 MB.
+  common_mistake: '**The Pitfall:** Assuming the parameter count directly translates to megabytes (e.g.,
+    15 million parameters = 15 MB) or forgetting to account for the byte-size of the data type.
+
+    **The Rationale:** Engineers without a systems background often conflate the logical number of parameters
+    with their physical byte representation on disk.
+
+    **The Consequence:** This error in storage calculation leads to severely underestimating the final
+    app bundle size and potential rejection from the App Store.'
+  napkin_math: '**Assumptions & Constraints:** The model has 15,000,000 parameters. FP32 requires 4 bytes
+    per parameter. FP16 requires 2 bytes per parameter. CoreML metadata overhead is negligible.
+
+
+    **Calculations:** Original FP32 size = 15,000,000 * 4 bytes = 60,000,000 bytes (60 MB). Converted
+    FP16 size = 15,000,000 * 2 bytes = 30,000,000 bytes (30 MB).
+
+
+    **Conclusion:** The conversion process yields a 50% reduction in disk footprint, resulting in a 30
+    MB file.'
+status: draft
+provenance: llm-draft
+requires_explanation: false
+expected_time_minutes: 5
+tags:
+- coreml
+- pytorch
+- precision
+- ios
+- gap-bridge:mobile-0984-mobile-1022
+authors:
+- gemini-3.1-pro-preview
+human_reviewed:
+  status: not-reviewed
+  by: null
+  date: null
+created_at: '2026-05-01T17:28:18+00:00'
--- a/interviews/vault/questions/mobile/power/mobile-2146.yaml
+++ b/interviews/vault/questions/mobile/power/mobile-2146.yaml
@@ -0,0 +1,67 @@
+schema_version: '1.0'
+id: mobile-2146
+track: mobile
+level: L3
+zone: realization
+topic: duty-cycling
+competency_area: power
+bloom_level: apply
+phase: inference
+title: 'The Hidden Cost of Waking Up: Dashcam Duty Cycling'
+scenario: 'You are optimizing a smartphone dashcam app that duty-cycles the NPU. Every 10 seconds, the
+  device executes a cycle: it wakes up the NPU (taking 0.5 seconds at a peak 4W power draw), runs inference
+  for 2 seconds at 3W, and then idles the SoC for the remaining 7.5 seconds at 0.5W.'
+question: Calculate the total energy consumed by the dashcam feature over a 1-hour driving session, explicitly
+  factoring in the transient wake-up overhead.
+details:
+  realistic_solution: 'First, calculate the energy per 10-second cycle by summing the energy of each phase:
+    Wake-up (0.5s * 4W = 2J), Active (2s * 3W = 6J), and Idle (7.5s * 0.5W = 3.75J), totaling 11.75 Joules.
+    Then, multiply this by the 360 cycles in a 1-hour period (3600 seconds) to determine the total consumption
+    of 4,230 Joules.'
+  common_mistake: '**The Pitfall:** Ignoring the wake-up transition time and power overhead when calculating
+    the duty cycle energy.
+
+    **The Rationale:** Developers often assume duty cycling only involves binary active and idle states,
+    overlooking the transient hardware costs like powering up the NPU and loading initial weights into
+    SRAM.
+
+    **The Consequence:** The theoretical energy budget is severely underestimated (in this case by ~17%),
+    leading to unexpected battery drain and missed power targets in production.'
+  napkin_math: '**Assumptions & Constraints:** 1 hour = 3600 seconds. A 10-second cycle occurs 360 times
+    per hour. Energy (Joules) = Power (Watts) * Time (Seconds).
+
+
+    **Calculations:**
+
+    - Energy_wakeup = 4W * 0.5s = 2J
+
+    - Energy_active = 3W * 2s = 6J
+
+    - Energy_idle = 0.5W * 7.5s = 3.75J
+
+    - Energy_cycle = 2J + 6J + 3.75J = 11.75J
+
+    - Total_Energy = 11.75 J/cycle * 360 cycles = 4230 J
+
+
+    **Conclusion:** The total energy consumed over 1 hour is 4,230 Joules. Notably, the wake-up overhead
+    accounts for a significant portion of the total energy despite occupying only 5% of the physical time,
+    demonstrating the limit of rapid duty-cycling.'
+status: draft
+provenance: llm-draft
+requires_explanation: false
+expected_time_minutes: 10
+tags:
+- duty-cycling
+- power-optimization
+- mobile
+- npu
+- energy-profiling
+- gap-bridge:mobile-0367-mobile-2034
+authors:
+- gemini-3.1-pro-preview
+human_reviewed:
+  status: not-reviewed
+  by: null
+  date: null
+created_at: '2026-05-01T17:27:47+00:00'