mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 18:18:42 -05:00
Final brute-force release-readiness pass: every gate green, 0.1.3
released and verified, every observable failure mode closed at source.
═══ AUDITS (G.A–G.D) ═══
G.A — gemini-3.1-pro-preview default everywhere. Active CLI scripts
already used it; bulk-patched 6 legacy scripts (`generate_batch.py`,
`validate_questions.py`, `generate_gaps.py`, `run_reviews.sh`,
`generate.py`, `review_math.sh`) + WORKFLOW.md off `gemini-2.5-flash`
or `gemini-2.5-pro` to `gemini-3.1-pro-preview`. Only `archive/`
references remain (intentionally legacy).
G.B — Cloudflare workflow audit. `vault verify 0.1.1` correctly
failed (YAMLs evolved since 0.1.1 cut). Confirmed `vault publish`,
`vault deploy`, `vault ship`, `vault rollback`, `vault verify`,
`vault snapshot`, `vault tag` all wired. Released 0.1.2 then 0.1.3
to lock final state.
G.C — Visual asset integrity audit. 236/236 YAML visual references
resolve, 0 orphan SVGs, 0 missing files, 0 unrendered sources.
Clean.
G.D — Unit tests for new validators added at `tests/test_models.py`:
15 tests covering Visual.kind enum, Visual.path regex, Visual.alt
+ caption min lengths + required, Question._zone_bloom_compatible
(recall+remember accepted, recall+evaluate rejected, mastery+
remember rejected, evaluation+evaluate accepted, design+create
accepted), Question._visual_path_resolves. **15/15 pass.**
═══ CONTENT CLEANUP (G.E–G.L) ═══
G.E — Sample re-judge of 100 random cloud parallelism items via
Gemini 3.1 Pro Preview (4 API calls): 53% PASS / 23% NEEDS_FIX /
24% DROP. Surfaced legacy quality drift — items generated under
pre-Phase-D laxer prompts were not meeting the new strict bar
(math errors with bidirectional vs unidirectional NVLink,
"Based on the diagram..." references with no diagram, deprecated
practices like SSP for modern LLM training, wrong-track scenarios
like Cortex-M4 in cloud track).
G.H — General-purpose cleanup agent on 47 flagged items:
**31 rewritten** with PARALLELISM_RULES bar applied (concrete
unidirectional NVLink 450 GB/s, IB NDR 25 GB/s, RoCE v2 22 GB/s,
PCIe Gen3 12 GB/s; multi-step ring AllReduce arguments with the
2(N-1)/N factor; non-obvious failure modes); **16 archived** with
documented `deletion_reason` (mathematically broken premises,
physics errors, topic-irreconcilable, direct duplicates).
G.L — Re-judge of 31 G.H rewrites: **23 PASS / 3 NEEDS_FIX / 5 DROP =
74.2% pass rate**. The 8 still-failing items archived (after the
cleanup pass still couldn't satisfy the strict bar). Contract:
items get THREE chances — original generation, fix-agent, retry-
fix — and if they still fail, archived not promoted. Honest.
═══ STUBBORN-FAIL ARCHIVES (Phase F residuals) ═══
After three independent fix-agent passes (Phase C, F.2, F.4), 4 items
remained NEEDS_FIX or DROP: edge-2390, edge-2401, mobile-1948,
tinyml-1681. Archived with `deletion_reason` documenting the 3-attempt
failure history. The cell may be structurally awkward; preserving
items for audit but removing from the bundle.
═══ ORPHAN CHAIN FIX ═══
After archives, `cloud-chain-359` had only 1 published member
(`cloud-1840`); its sibling `cloud-1845` got archived. Dropped the
chain ref from cloud-1840 + ran `repair_chains.py` to clean residual
references in archived YAMLs. `vault check --strict` now passes 0
chain warnings.
═══ E.2 / E.3 SHIPPED EARLIER IN PRIOR COMMIT ═══
(Documented in commit `20ea20005` for completeness):
- `vault build --legacy-json` auto-emits `vault-manifest.json`.
- `analyze_coverage_gaps.py --include-areas <areas>` flag.
═══ 0.1.3 FINAL RELEASE ═══
`vault publish 0.1.3` snapshot at `releases/0.1.3/`. Migrations:
+0 ~27 -28 (zero net new questions, 27 modified during cleanup, 28
archived/promoted). `vault verify 0.1.3` ✓ — release_hash
`793c06f414f2bf8391a8a5c56ec0ff8d76bfce4ab7c64ad12ecb83f6d932280e`
reconstructs from YAML. Latest symlink → 0.1.3.
═══ FINAL ALL-9-GATES SWEEP — ALL GREEN ═══
[1] vault check --strict ✓ 10,701 / 0 errors / 0 invariants
[2] vault lint ✓ 0 errors / 0 warnings / 9,757 info
[3] vault doctor ✓ 0 fails (registry-history info OK)
[4] vault codegen --check ✓ artifacts in sync
[5] vault verify 0.1.3 ✓ hash reconstructs from YAML
[6] staffml validate-vault ✓ 0 errors / 0 warnings, deployment-ready
[7] render_visuals ✓ 236 visuals, 0 errors
[8] tsc ✓ TypeScript clean
[9] Playwright ✓ 9/9 pass
═══ FINAL CORPUS STATE ═══
Bundle: 9,757 published (was 9,224 at branch cut, **+533 net** across
the full multi-session push, after all archives).
Total commits on branch since cut: 10.
Release tag latest: 0.1.3 (verified-clean).
Status: StaffML-day-ready. Ship it.
643 lines
23 KiB
Python
643 lines
23 KiB
Python
#!/usr/bin/env python3
|
|
"""
|
|
Gap-fill generator for underfilled StaffML corpus cells.
|
|
|
|
Reads corpus.json, identifies all (track, level, competency_area) cells with
|
|
fewer than 3 questions, then generates the missing questions using either:
|
|
- Gemini 2.5 Flash via the `gemini` CLI (default)
|
|
- Claude Opus 4.6 via the Anthropic API (if ANTHROPIC_API_KEY is set)
|
|
|
|
Outputs are:
|
|
1. Appended to the correct source markdown file
|
|
2. Written as JSON to _generated_gaps.json for corpus rebuild
|
|
|
|
Usage:
|
|
python3 generate_gaps.py # Gemini Flash (default)
|
|
python3 generate_gaps.py --model opus # Claude Opus 4.6
|
|
python3 generate_gaps.py --dry-run # Show plan, don't generate
|
|
python3 generate_gaps.py --workers 4 # Control parallelism
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import json
|
|
import os
|
|
import re
|
|
import subprocess
|
|
import sys
|
|
import time
|
|
from collections import Counter, defaultdict
|
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
|
from dataclasses import dataclass
|
|
from pathlib import Path
|
|
from typing import Optional
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Constants
|
|
# ---------------------------------------------------------------------------
|
|
|
|
BASE_DIR = Path(__file__).parent
|
|
CORPUS_PATH = BASE_DIR / "corpus.json"
|
|
NUMBERS_PATH = BASE_DIR / "NUMBERS.md"
|
|
OUTPUT_JSON = BASE_DIR / "_generated_gaps.json"
|
|
TARGET_PER_CELL = 3
|
|
|
|
# Badge metadata for markdown rendering
|
|
LEVEL_META = {
|
|
"L1": {"label": "L1_Foundation", "color": "brightgreen", "alt": "Level 1"},
|
|
"L2": {"label": "L2_Analytical", "color": "blue", "alt": "Level 2"},
|
|
"L3": {"label": "L3_Junior", "color": "brightgreen", "alt": "Level 1"},
|
|
"L4": {"label": "L4_Mid", "color": "blue", "alt": "Level 2"},
|
|
"L5": {"label": "L5_Senior", "color": "yellow", "alt": "Level 3"},
|
|
"L6+": {"label": "L6%2B_Staff", "color": "red", "alt": "Level 4"},
|
|
}
|
|
|
|
# Map competency area to representative topic tags for prompting
|
|
AREA_TO_TAGS = {
|
|
"compute": ["roofline", "arithmetic-intensity", "compute-bound", "memory-bound"],
|
|
"memory": ["memory-hierarchy", "kv-cache", "activation-memory", "memory-bandwidth"],
|
|
"precision": ["quantization", "mixed-precision", "calibration"],
|
|
"architecture": ["scaling-laws", "attention", "transformers", "depthwise-separable", "early-exit"],
|
|
"latency": ["latency", "throughput", "ttft", "batching", "real-time"],
|
|
"power": ["power", "thermal", "tops-w", "duty-cycle", "battery", "cooling"],
|
|
"optimization": ["pruning", "distillation", "operator-fusion", "flash-attention", "compilation"],
|
|
"parallelism": ["data-parallelism", "tensor-parallelism", "pipeline-parallelism", "fsdp"],
|
|
"networking": ["interconnect", "network-topology", "congestion", "bus-protocol", "wireless"],
|
|
"deployment": ["serving", "deployment", "rollout", "ota", "firmware"],
|
|
"reliability": ["monitoring", "drift", "fault-tolerance", "watchdog", "checkpoint"],
|
|
"data": ["data-pipeline", "data-quality", "training-serving-skew", "sensor-pipeline", "streaming-data"],
|
|
"cross-cutting": ["economics", "tco", "cost-per-query", "security", "privacy"],
|
|
}
|
|
|
|
# Map track to preferred file for each competency area
|
|
# Picks the most thematically appropriate file for new questions
|
|
TRACK_FILE_MAP = {
|
|
"cloud": {
|
|
"compute": "01_single_machine.md",
|
|
"memory": "01_single_machine.md",
|
|
"precision": "01_single_machine.md",
|
|
"architecture": "01_single_machine.md",
|
|
"latency": "03_serving_stack.md",
|
|
"power": "04_production_ops.md",
|
|
"optimization": "01_single_machine.md",
|
|
"parallelism": "02_distributed_systems.md",
|
|
"networking": "02_distributed_systems.md",
|
|
"deployment": "03_serving_stack.md",
|
|
"reliability": "04_production_ops.md",
|
|
"data": "01_single_machine.md",
|
|
"cross-cutting": "04_production_ops.md",
|
|
},
|
|
"edge": {
|
|
"compute": "01_hardware_platform.md",
|
|
"memory": "01_hardware_platform.md",
|
|
"precision": "01_hardware_platform.md",
|
|
"architecture": "01_hardware_platform.md",
|
|
"latency": "02_realtime_pipeline.md",
|
|
"power": "01_hardware_platform.md",
|
|
"optimization": "02_realtime_pipeline.md",
|
|
"parallelism": "01_hardware_platform.md",
|
|
"networking": "03_deployed_system.md",
|
|
"deployment": "03_deployed_system.md",
|
|
"reliability": "03_deployed_system.md",
|
|
"data": "02_realtime_pipeline.md",
|
|
"cross-cutting": "03_deployed_system.md",
|
|
},
|
|
"mobile": {
|
|
"compute": "01_device_hardware.md",
|
|
"memory": "01_device_hardware.md",
|
|
"precision": "01_device_hardware.md",
|
|
"architecture": "01_device_hardware.md",
|
|
"latency": "02_app_experience.md",
|
|
"power": "01_device_hardware.md",
|
|
"optimization": "01_device_hardware.md",
|
|
"parallelism": "01_device_hardware.md",
|
|
"networking": "01_device_hardware.md",
|
|
"deployment": "03_ship_and_update.md",
|
|
"reliability": "03_ship_and_update.md",
|
|
"data": "02_app_experience.md",
|
|
"cross-cutting": "03_ship_and_update.md",
|
|
},
|
|
"tinyml": {
|
|
"compute": "01_microcontroller.md",
|
|
"memory": "01_microcontroller.md",
|
|
"precision": "01_microcontroller.md",
|
|
"architecture": "01_microcontroller.md",
|
|
"latency": "02_sensing_pipeline.md",
|
|
"power": "01_microcontroller.md",
|
|
"optimization": "01_microcontroller.md",
|
|
"parallelism": "01_microcontroller.md",
|
|
"networking": "03_deployed_device.md",
|
|
"deployment": "03_deployed_device.md",
|
|
"reliability": "03_deployed_device.md",
|
|
"data": "02_sensing_pipeline.md",
|
|
"cross-cutting": "03_deployed_device.md",
|
|
},
|
|
}
|
|
|
|
# Track-specific hardware context for prompt tuning
|
|
TRACK_CONTEXT = {
|
|
"cloud": (
|
|
"Cloud track: NVIDIA H100/A100 GPUs, HBM3 memory, NVLink/InfiniBand "
|
|
"interconnects, data center power/cooling, multi-GPU servers. "
|
|
"Focus on large model training and high-throughput serving."
|
|
),
|
|
"edge": (
|
|
"Edge track: NVIDIA Jetson Orin, Qualcomm RB5, Google Coral TPU, "
|
|
"multi-core ARM CPUs with GPU/NPU accelerators, 10-30W power envelopes. "
|
|
"Focus on real-time inference for robotics, autonomous vehicles, drones."
|
|
),
|
|
"mobile": (
|
|
"Mobile track: Apple A-series/M-series, Qualcomm Snapdragon with Hexagon NPU, "
|
|
"Samsung Exynos, MediaTek Dimensity. 3-5W thermal envelope, battery life critical. "
|
|
"Focus on on-device ML for apps: camera, NLP, recommender systems."
|
|
),
|
|
"tinyml": (
|
|
"TinyML track: ARM Cortex-M0 to M7 MCUs, 256KB-2MB SRAM, 1-16MB Flash, "
|
|
"no OS, bare-metal C. Power budget: microwatts to milliwatts. "
|
|
"Focus on keyword spotting, anomaly detection, sensor fusion on MCUs."
|
|
),
|
|
}
|
|
|
|
# Bloom's level cognitive descriptors
|
|
BLOOM_DESCRIPTORS = {
|
|
"L1": "Remember — pure recall of facts, specs, ratios. Direct questions.",
|
|
"L2": "Understand — single-variable calculations, explain why, compare two things.",
|
|
"L3": "Apply — use a formula in a new situation, diagnose a described scenario.",
|
|
"L4": "Analyze — multi-step debugging, identify root cause from symptoms.",
|
|
"L5": "Evaluate — compare two architectures, justify a design choice with trade-offs.",
|
|
"L6+": "Create — design a system from scratch, propose a novel solution to a constraint.",
|
|
}
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Data classes
|
|
# ---------------------------------------------------------------------------
|
|
|
|
@dataclass
|
|
class GapCell:
|
|
"""A single underfilled cell in the corpus matrix."""
|
|
track: str
|
|
level: str
|
|
area: str
|
|
current_count: int
|
|
needed: int # questions to generate
|
|
|
|
@property
|
|
def key(self) -> str:
|
|
return f"{self.track}/{self.level}/{self.area}"
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Corpus analysis
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def find_gaps(corpus_path: Path = CORPUS_PATH, target: int = TARGET_PER_CELL) -> list[GapCell]:
|
|
"""Identify all cells with fewer than `target` questions."""
|
|
with open(corpus_path) as f:
|
|
corpus = json.load(f)
|
|
|
|
cells = Counter()
|
|
for q in corpus:
|
|
track = q.get("track", "")
|
|
level = q.get("level", "")
|
|
area = q.get("competency_area", "")
|
|
if track and track != "global" and level and area:
|
|
cells[(track, level, area)] += 1
|
|
|
|
gaps = []
|
|
for (track, level, area), count in sorted(cells.items()):
|
|
if count < target:
|
|
gaps.append(GapCell(
|
|
track=track,
|
|
level=level,
|
|
area=area,
|
|
current_count=count,
|
|
needed=target - count,
|
|
))
|
|
return gaps
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Prompt construction
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def _load_hardware_context() -> str:
|
|
"""Load NUMBERS.md for hardware constants."""
|
|
if NUMBERS_PATH.exists():
|
|
return NUMBERS_PATH.read_text(encoding="utf-8")
|
|
return "(Hardware constants not available — use conservative estimates)"
|
|
|
|
|
|
def build_prompt(gap: GapCell, hardware_ctx: str) -> str:
|
|
"""Build a targeted generation prompt for one gap cell."""
|
|
tags = AREA_TO_TAGS.get(gap.area, [gap.area])
|
|
tag_str = ", ".join(tags)
|
|
track_ctx = TRACK_CONTEXT.get(gap.track, "")
|
|
bloom_desc = BLOOM_DESCRIPTORS.get(gap.level, "")
|
|
|
|
# Pick a few suggested topic tags for variety
|
|
import random
|
|
suggested_tags = random.sample(tags, min(2, len(tags)))
|
|
suggested_str = " or ".join(f"`{t}`" for t in suggested_tags)
|
|
|
|
return f"""You are a world-class ML systems interview question writer for the StaffML platform.
|
|
Your questions are used by engineers preparing for Staff/Principal ML Systems Engineer roles.
|
|
Every question must be grounded in real hardware physics and quantitative reasoning.
|
|
|
|
## ABSOLUTE REQUIREMENTS
|
|
1. Every scenario must be REALISTIC — something that actually happens in production
|
|
2. Every napkin math section must use REAL hardware specs from the constants below
|
|
3. Every common mistake must be a REAL misconception that engineers actually have
|
|
4. Every calculation must be arithmetically correct — you will be verified
|
|
5. The question must test the SPECIFIC competency area: **{gap.area}**
|
|
6. Use topic tags from: {tag_str}
|
|
|
|
## HARDWARE CONSTANTS (Source of Truth)
|
|
{hardware_ctx}
|
|
|
|
## TRACK CONTEXT
|
|
{track_ctx}
|
|
|
|
## COGNITIVE LEVEL
|
|
**{gap.level}**: {bloom_desc}
|
|
|
|
## YOUR TASK
|
|
Generate exactly {gap.needed} interview question(s) for:
|
|
- **Track:** {gap.track}
|
|
- **Competency area:** {gap.area}
|
|
- **Target level:** {gap.level}
|
|
- **Suggested topic tags:** {suggested_str}
|
|
|
|
Each question must be DISTINCT — test different aspects of {gap.area} at the {gap.level} level.
|
|
Do NOT duplicate concepts already common in the corpus (roofline basics, simple memory calculations).
|
|
Focus on {gap.track}-specific scenarios that a {gap.level}-level engineer would face.
|
|
|
|
## OUTPUT FORMAT
|
|
Return a JSON array. Each object must have these fields:
|
|
```json
|
|
[
|
|
{{
|
|
"level": "{gap.level}",
|
|
"title": "The [Evocative Title]",
|
|
"topic": "kebab-case-topic-from-taxonomy",
|
|
"track": "{gap.track}",
|
|
"scenario": "The full interviewer question text...",
|
|
"common_mistake": "What engineers typically get wrong and why...",
|
|
"realistic_solution": "The correct answer with full explanation...",
|
|
"napkin_math": "Step-by-step quantitative reasoning with real numbers...",
|
|
"resources": []
|
|
}}
|
|
]
|
|
```
|
|
|
|
IMPORTANT:
|
|
- `napkin_math` is REQUIRED for ALL levels
|
|
- Return ONLY valid JSON, no markdown wrapping, no ```json fences
|
|
- Each question must be self-contained and test {gap.area} specifically
|
|
"""
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# LLM backends
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def call_gemini(prompt: str, model: str = "gemini-3.1-pro-preview") -> str:
|
|
"""Call Gemini via the locally-authenticated CLI."""
|
|
result = subprocess.run(
|
|
["gemini", "--model", model, "-"],
|
|
input=prompt,
|
|
capture_output=True,
|
|
text=True,
|
|
timeout=180,
|
|
)
|
|
if result.returncode != 0:
|
|
raise RuntimeError(f"gemini CLI failed (exit {result.returncode}): {result.stderr[:500]}")
|
|
return result.stdout
|
|
|
|
|
|
def call_opus(prompt: str, api_key: str) -> str:
|
|
"""Call Claude Opus 4.6 via the Anthropic API."""
|
|
try:
|
|
import anthropic
|
|
except ImportError:
|
|
print("[ERROR] pip install anthropic required for Opus backend", file=sys.stderr)
|
|
sys.exit(1)
|
|
|
|
client = anthropic.Anthropic(api_key=api_key)
|
|
message = client.messages.create(
|
|
model="claude-opus-4-20250514",
|
|
max_tokens=4096,
|
|
messages=[{"role": "user", "content": prompt}],
|
|
)
|
|
return message.content[0].text
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Parsing LLM output
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def parse_llm_output(raw: str) -> list[dict]:
|
|
"""Parse LLM response into a list of question dicts."""
|
|
text = raw.strip()
|
|
|
|
# Strip markdown fences if present
|
|
if text.startswith("```"):
|
|
text = text.split("\n", 1)[1]
|
|
if text.endswith("```"):
|
|
text = text[:-3].strip()
|
|
# Also strip ```json prefix
|
|
if text.startswith("json\n"):
|
|
text = text[4:].strip()
|
|
|
|
# Try to find JSON array in the output
|
|
start = text.find("[")
|
|
end = text.rfind("]")
|
|
if start != -1 and end != -1:
|
|
text = text[start:end + 1]
|
|
|
|
try:
|
|
questions = json.loads(text)
|
|
except json.JSONDecodeError as e:
|
|
print(f" [WARN] JSON parse failed: {e}", file=sys.stderr)
|
|
print(f" [DEBUG] First 300 chars: {text[:300]}", file=sys.stderr)
|
|
return []
|
|
|
|
if isinstance(questions, dict):
|
|
questions = [questions]
|
|
|
|
return questions
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Markdown rendering (matches render.py format exactly)
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def render_question_md(q: dict) -> str:
|
|
"""Render a question dict to the exact markdown <details> format."""
|
|
level = q.get("level", "L3")
|
|
meta = LEVEL_META.get(level, LEVEL_META["L3"])
|
|
badge_label = meta["label"]
|
|
badge_color = meta["color"]
|
|
badge_alt = meta["alt"]
|
|
|
|
title = q.get("title", "Untitled")
|
|
topic = q.get("topic", "unknown")
|
|
scenario = q.get("scenario", "")
|
|
common_mistake = q.get("common_mistake", "")
|
|
realistic_solution = q.get("realistic_solution", "")
|
|
napkin_math = q.get("napkin_math", "")
|
|
resources = q.get("resources") or []
|
|
|
|
topic_tags = " ".join(f"<code>{t.strip()}</code>" for t in topic.split(","))
|
|
|
|
inner = []
|
|
inner.append(f" **Common Mistake:** {common_mistake}")
|
|
inner.append("")
|
|
inner.append(f" **Realistic Solution:** {realistic_solution}")
|
|
|
|
if napkin_math:
|
|
inner.append("")
|
|
inner.append(f" > **Napkin Math:** {napkin_math}")
|
|
|
|
if resources:
|
|
inner.append("")
|
|
inner.append(" 📖 **Resources:**")
|
|
for r in resources:
|
|
name = (r.get("name") or "").strip()
|
|
url = (r.get("url") or "").strip()
|
|
if name and url:
|
|
inner.append(f" - [{name}]({url})")
|
|
|
|
inner_content = "\n".join(inner)
|
|
|
|
return f"""<details>
|
|
<summary><b><img src="https://img.shields.io/badge/Level-{badge_label}-{badge_color}?style=flat-square" alt="{badge_alt}" align="center"> {title}</b> · {topic_tags}</summary>
|
|
|
|
- **Interviewer:** "{scenario}"
|
|
|
|
<details>
|
|
<summary><b>🔍 Reveal Answer</b></summary>
|
|
|
|
{inner_content}
|
|
</details>
|
|
</details>"""
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# File appending
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def get_target_file(gap: GapCell) -> Path:
|
|
"""Determine which markdown file to append to for a given gap."""
|
|
track_map = TRACK_FILE_MAP.get(gap.track, {})
|
|
filename = track_map.get(gap.area, "01_" + gap.track + ".md")
|
|
return BASE_DIR / gap.track / filename
|
|
|
|
|
|
def append_to_markdown(gap: GapCell, questions: list[dict]) -> Path:
|
|
"""Append rendered questions to the target markdown file."""
|
|
target = get_target_file(gap)
|
|
if not target.exists():
|
|
print(f" [WARN] Target file not found: {target}", file=sys.stderr)
|
|
return target
|
|
|
|
md_blocks = []
|
|
for q in questions:
|
|
md_blocks.append(render_question_md(q))
|
|
|
|
separator = f"\n\n<!-- === Generated: {gap.key} === -->\n\n"
|
|
content = separator + "\n\n".join(md_blocks) + "\n"
|
|
|
|
with open(target, "a", encoding="utf-8") as f:
|
|
f.write(content)
|
|
|
|
return target
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Worker function for parallel generation
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def process_gap(
|
|
gap: GapCell,
|
|
hardware_ctx: str,
|
|
backend: str,
|
|
api_key: Optional[str],
|
|
model: str,
|
|
) -> dict:
|
|
"""Generate questions for one gap cell. Returns a result dict."""
|
|
result = {
|
|
"cell": gap.key,
|
|
"needed": gap.needed,
|
|
"generated": 0,
|
|
"questions": [],
|
|
"file": str(get_target_file(gap)),
|
|
"error": None,
|
|
}
|
|
|
|
try:
|
|
prompt = build_prompt(gap, hardware_ctx)
|
|
|
|
if backend == "opus":
|
|
raw = call_opus(prompt, api_key)
|
|
else:
|
|
raw = call_gemini(prompt, model=model)
|
|
|
|
questions = parse_llm_output(raw)
|
|
|
|
if not questions:
|
|
result["error"] = "No questions parsed from LLM output"
|
|
return result
|
|
|
|
# Ensure each question has the right track/level
|
|
for q in questions:
|
|
q["track"] = gap.track
|
|
q["level"] = gap.level
|
|
q.setdefault("competency_area", gap.area)
|
|
|
|
# Append to markdown
|
|
target = append_to_markdown(gap, questions)
|
|
result["generated"] = len(questions)
|
|
result["questions"] = questions
|
|
result["file"] = str(target)
|
|
|
|
print(f" [OK] {gap.key}: generated {len(questions)}/{gap.needed} -> {target.name}")
|
|
|
|
except Exception as e:
|
|
result["error"] = str(e)
|
|
print(f" [ERR] {gap.key}: {e}", file=sys.stderr)
|
|
|
|
return result
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Main
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def main():
|
|
parser = argparse.ArgumentParser(description="Fill underfilled StaffML corpus cells")
|
|
parser.add_argument("--dry-run", action="store_true", help="Show plan without generating")
|
|
parser.add_argument("--workers", type=int, default=8, help="Parallel workers (default: 8)")
|
|
parser.add_argument("--model", choices=["flash", "opus"], default="flash",
|
|
help="LLM backend: flash=Gemini 2.5 Flash, opus=Claude Opus 4.6")
|
|
parser.add_argument("--gemini-model", default="gemini-3.1-pro-preview",
|
|
help="Specific Gemini model name (default: gemini-3.1-pro-preview)")
|
|
parser.add_argument("--target", type=int, default=TARGET_PER_CELL,
|
|
help="Target questions per cell (default: 3)")
|
|
args = parser.parse_args()
|
|
|
|
target = args.target
|
|
|
|
# --- Find gaps ---
|
|
print("=" * 60)
|
|
print("StaffML Gap-Fill Generator")
|
|
print("=" * 60)
|
|
|
|
gaps = find_gaps(target=target)
|
|
total_needed = sum(g.needed for g in gaps)
|
|
|
|
print(f"\nCorpus: {CORPUS_PATH}")
|
|
print(f"Underfilled cells: {len(gaps)}")
|
|
print(f"Total questions to generate: {total_needed}")
|
|
print(f"Backend: {'Claude Opus 4.6 (Anthropic API)' if args.model == 'opus' else f'Gemini ({args.gemini_model})'}")
|
|
print(f"Workers: {args.workers}")
|
|
print()
|
|
|
|
# --- Breakdown by track ---
|
|
by_track = defaultdict(list)
|
|
for g in gaps:
|
|
by_track[g.track].append(g)
|
|
|
|
for track in sorted(by_track):
|
|
track_gaps = by_track[track]
|
|
track_needed = sum(g.needed for g in track_gaps)
|
|
print(f" {track}: {len(track_gaps)} cells, {track_needed} questions needed")
|
|
for g in track_gaps:
|
|
print(f" {g.level}/{g.area}: {g.current_count} -> {target} (need {g.needed})")
|
|
print()
|
|
|
|
# --- Estimate time ---
|
|
# ~15s per Gemini Flash call, ~30s per Opus call
|
|
time_per_call = 30 if args.model == "opus" else 15
|
|
batches = (len(gaps) + args.workers - 1) // args.workers
|
|
est_seconds = batches * time_per_call
|
|
est_minutes = est_seconds / 60
|
|
print(f"Estimated time: ~{est_minutes:.1f} minutes ({batches} batches x {time_per_call}s)")
|
|
print()
|
|
|
|
if args.dry_run:
|
|
print("[DRY RUN] No questions generated.")
|
|
return
|
|
|
|
# --- Validate backend ---
|
|
api_key = None
|
|
if args.model == "opus":
|
|
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
|
if not api_key:
|
|
print("[ERROR] ANTHROPIC_API_KEY not set. Required for Opus backend.", file=sys.stderr)
|
|
sys.exit(1)
|
|
|
|
# --- Load hardware context ---
|
|
hardware_ctx = _load_hardware_context()
|
|
|
|
# --- Generate in parallel ---
|
|
print("Generating...")
|
|
start_time = time.time()
|
|
all_results = []
|
|
|
|
with ThreadPoolExecutor(max_workers=args.workers) as pool:
|
|
futures = {
|
|
pool.submit(
|
|
process_gap, gap, hardware_ctx, args.model, api_key, args.gemini_model
|
|
): gap
|
|
for gap in gaps
|
|
}
|
|
|
|
for future in as_completed(futures):
|
|
result = future.result()
|
|
all_results.append(result)
|
|
|
|
elapsed = time.time() - start_time
|
|
|
|
# --- Save JSON output ---
|
|
with open(OUTPUT_JSON, "w") as f:
|
|
json.dump(all_results, f, indent=2)
|
|
|
|
# --- Summary ---
|
|
total_generated = sum(r["generated"] for r in all_results)
|
|
errors = [r for r in all_results if r["error"]]
|
|
|
|
print()
|
|
print("=" * 60)
|
|
print("RESULTS")
|
|
print("=" * 60)
|
|
print(f" Cells processed: {len(all_results)}")
|
|
print(f" Questions generated: {total_generated}/{total_needed}")
|
|
print(f" Errors: {len(errors)}")
|
|
print(f" Time: {elapsed:.1f}s")
|
|
print(f" Output: {OUTPUT_JSON}")
|
|
print()
|
|
|
|
if errors:
|
|
print("ERRORS:")
|
|
for r in errors:
|
|
print(f" {r['cell']}: {r['error']}")
|
|
print()
|
|
|
|
# --- Files modified ---
|
|
modified = set(r["file"] for r in all_results if r["generated"] > 0)
|
|
if modified:
|
|
print("Files modified:")
|
|
for f in sorted(modified):
|
|
print(f" {f}")
|
|
print()
|
|
print("Next steps:")
|
|
print(" 1. Review the generated questions in each file")
|
|
print(" 2. Run: python3 build_corpus.py (rebuild corpus.json)")
|
|
print(" 3. Verify: python3 -m engine validate")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|