cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-08 02:28:25 -05:00

Author	SHA1	Message	Date
Vijay Janapa Reddi	2b381bb949	refactor(vault-cli): rename --legacy-json to --local-json The flag is the StaffML frontend's local-dev fallback (read corpus.json from disk via NEXT_PUBLIC_VAULT_FALLBACK=static), not a deprecated path. "Legacy" implied "soon to be removed"; "local-json" describes its actual role and reads correctly in scripts and docs. - vault-cli: rename CLI flag, parameter, result key, and help text. - CI workflows + pre-commit config: invoke the new flag name. - All scripts that print the command (suggest_exemplars, pre_commit_corpus_guard, promote_validated, rename_legacy_ids, export_to_staffml, the paper analyze_corpus/generate_*) updated. - Comments and docs (ARCHITECTURE, CHANGELOG, REVIEWS, TESTING, MASSIVE_BUILD_RUNBOOK, DEPRECATED, AUTHORING, plus frontend comments and .env.example / .gitignore) updated. The "legacy_json" sentinel string in corpus_stats.json._meta.source is intentionally NOT renamed — it is a stable artifact format read by downstream paper-generation tooling.	2026-04-30 09:30:28 -04:00
Vijay Janapa Reddi	542aaf95d2	cleanup(vault): release-ready Phase A — schema hardening + lint calibration + chain repair Closes the cleanup arc (A.1–A.10 in RESUME_PLAN_RELEASE.md). Every gate is now green: vault check --strict, vault lint, vault doctor, vault codegen --check, staffml validate-vault, Playwright (9/9), tsc. A.1 mobile-1962.svg: renamed `Edge` → `RegEdge` in graphviz source (`Edge` is a reserved keyword); SVG renders cleanly. Also fixed tinyml-1570.py (missing `import numpy as np`) which the new failure log surfaced. A.2 render_visuals.py: structured per-ID failure log written to `_validation_results/render_failures.json` on every run; non-zero exit on any per-item crash; new `--fail-fast` and `--failure-log` CLI options. Replaces the prior silent-failure mode. A.3 LinkML visual schema: typed as a structured sub-schema. New `VisualKind` enum (svg only — `mermaid` was reserved but never shipped, dropped to keep the enum honest). Path regex tightened to `^[a-z0-9-]+\.svg$`. Alt minimum length 10, caption required minimum length 5. TypeScript Visual interface + Question.visual field added to staffml-vault-types/index.ts. A.4 Pydantic Visual + Question validators: - Visual.kind hard-rejects anything but `svg` - Visual.path enforces the new regex - Visual.alt min 10 chars, caption required min 5 chars - Question.model_validator: visual.path MUST resolve to a real file under interviews/vault/visuals/<track>/. Skipped in production deploys where the working tree is absent. A.5 Registry repair + doctor split: - tools: repair_registry.py appended 5,269 missing IDs (the rename refactor at `8a5c3ff3c` left the append-only registry unsynced; this brings disk-coverage to 100%). Header block in id-registry.yaml documents the rebuild rationale. - doctor.py: split symmetric `registry-integrity` check into `disk-coverage` (HARD FAIL if any disk YAML id is unregistered) and `registry-history` (INFO ONLY for retired ids — the registry is by design an audit log, retired ids are normal). Pre-existing `_check_schema_version` bug (`versions == {1}` vs string `"1.0"`) fixed. A.6 Lint calibration via 4-expert consensus + bloom-canonical reclassification: - Spawned 4 experts (Vijay Reddi, Chip Huyen, Jeff Dean, education-reviewer) on 42 disputed (zone, level) pairs; consensus-builder aggregated to 15 valid / 19 invalid / 8 borderline. - User arbitrated 8 borderlines: 7 widen / 1 reclassify. - Built ZONE_BLOOM_AFFINITY matrix (Education-Reviewer's idea): every zone admits its dominant Bloom verb + adjacent verbs, rejects clear hierarchy violations. - reclassify_zone_bloom_mismatch.py applied 576 deterministic zone fixes via BLOOM_CANONICAL_ZONE mapping (e.g. fluency+analyze → analyze, recall+analyze → analyze, evaluation+apply → implement). - Question.model_validator(_zone_bloom_compatible): hard-rejects future zone-bloom mismatches at write time. Generated drafts can no longer ship a self-contradicting classification. - ZONE_LEVEL_AFFINITY widened per consensus + arbitration + post-reclassification adjustments. Lint warnings: 1,308 → 0. A.7 Chain integrity: - repair_chains.py: drops chain refs when a chain has <2 published members (chain ceases to exist), renumbers all members of any chain whose positions are non-sequential / duplicated / non-monotonic-by-level. Sort key: level ascending, then old position, then qid (deterministic). - validate-vault.py: relaxed sequential check to unique-positions check. Position gaps from mid-chain deletions are normal; what matters is uniqueness + bloom-monotonicity (vault check --strict enforces both from YAML source-of-truth). A.8 Practice page visual + zoom modal: - QuestionVisual.tsx: wraps the `<img>` in `<Zoom>` from react-medium-image-zoom (4 KB). Click image → fullscreen `<dialog data-rmiz-modal>`; ESC closes. Added test-id `question-visual-img` for stable selector. - New Playwright test: 9th in the suite, deep-links cloud-4492, asserts the dialog opens on click and closes on ESC. - TypeScript: removed `mermaid` from local Visual types in corpus.ts and corpus-vault.ts; tsc clean. A.9 All gates green: - vault check --strict: 0 errors / 0 invariant failures - vault lint: 0 errors / 0 warnings (was 1,308 warnings) - vault codegen --check: artifacts in sync (hash baseline updated) - vault doctor: 0 fails (registry-history info, git-state warn on uncommitted state-pre-this-commit) - staffml validate-vault: 0 errors / 0 warnings, deployment-ready - Playwright: 9/9 pass (was 8; +zoom modal test) - render_visuals: 0 errors (was 2 silent failures pre-A.2) - tsc: clean Distribution after reclassification: 9,544 published unchanged; 576 items moved zone via bloom-canonical mapping (full per-item report at /tmp/reclassify_changes.csv). Chain count 879 → 850 after orphan-singleton drops. release_hash updated. Carry-forward to next session (Phase B): - Priority gap closure for parallelism cells + global L4-L6+ (the run that produced this corpus did not close the targeted cells; B.3 needs specialized prompts per cell-class) - 120 NEEDS_FIX items from coverage_loop/20260425_150712/ still carry judge fix_suggestions; spawn fix-agent in Phase C	2026-04-25 15:12:51 -04:00
Vijay Janapa Reddi	ece6eccf23	feat(vault): massive build — 630 drafts generated, 320 PASS promoted, paper 0.1.1 Phase 1 (analyzer): top-priority cells: tinyml/parallelism (0/90), tinyml/networking (2/90), mobile/parallelism (0/127), edge/parallelism (12/152), global/L4-L6+ deeply empty. Phase 2 (loop): 6 iterations, 50 of 80 API calls used, 630 drafts generated (52% PASS / 19% NEEDS_FIX / 26% DROP / ~6% unjudged). Saturation reason: same top-priority cell two iterations in a row — converged. Top-priority decay 2.25 → 2.14 → 2.03 → 1.93 → 1.83 plateaued; generator cannot meaningfully shrink tinyml/specification/L6+ further within current prompt framing. Both halt conditions (gap-threshold 0.8, max-calls 80) had headroom; structural convergence fired first. Loop defaults bumped: max-iters 20 → 30, max-calls 60 → 80, batch 12 → 30, calls/iter 3 → 4, judge chunk 15 → 25. Phase 3 (quality): Spot-read 4 PASS items + visuals across cloud/edge/ mobile/tinyml. All technically sound, math correct, real hardware grounding (MI300X, Jetson Orin, Cortex-M4 BLE), SVGs follow svg-style.md palette. Systemic finding: generator emitted 462 drafts with malformed competency_area values (60 distinct patterns: zones-as-area, bloom-verbs-as-area, underscore hallucinations, dash-form/slash-form concatenations). Resolved by extending fix_competency_areas.py REMAP table; re-run cleanup mapped all 462 to canonical. Root cause — generator skips Pydantic validation at write time — flagged for follow-on fix; not blocking. Phase 4 (promote): 320 PASS items promoted; bundle 9,224 → 9,544 published (exactly +320). Visual assets: 234 in bundle, mirrored to staffml/public/. Phase 5 (paper): Cut 0.1.1 release (patch bump: content addition, no schema change). release_hash 0350da5706e6. macros.tex regenerated to 9,544/87 topics/ 13 areas/11 zones; 4 figures rebuilt; paper.tex zone counts updated (1,583/1,227/1,113 → 1,615/1,256/1,144). PDF compiles to 25 pages, no LaTeX errors (citation warnings pre-existing). Phase 6 (GUI): All 8 Playwright tests pass on fresh dev server. /practice HTML contains zero malformed area names (down from 60 distinct pre-fix). Phase 7 (manifest): vault-manifest.json refreshed: questionCount 9224 → 9544, contentHash 539eb877f9cc → 0350da5706e6, track + level distributions updated to match 0.1.1 corpus. Loop run dir: interviews/vault/_validation_results/coverage_loop/20260425_150712 Deferred queue (next session): 120 NEEDS_FIX items carrying judge fix_suggestions + 165 DROP items, plus the generator validate-at-write fix. The runbook (vault/docs/MASSIVE_BUILD_RUNBOOK.md) is the methodology this session followed; can be re-run on any future generation day.	2026-04-25 13:15:41 -04:00
Vijay Janapa Reddi	0afc384282	feat(vault): LLM-as-judge validator + iterative coverage loop Two new pieces close the generation→validation→saturation feedback loop: 1. gemini_cli_llm_judge.py — multi-criteria validator. For each draft, judges math correctness, cell-fit (does it actually target the declared track/zone/level?), scenario realism, uniqueness vs canonical questions, and visual-asset alignment. Returns PASS/NEEDS_FIX/DROP per item. Batched (default 15 per call) for budget efficiency. 2. iterate_coverage_loop.py — drives the full loop: analyze → plan → generate → render → judge → apply → re-analyze. Self-paced: stops when (a) top priority gap drops below threshold, (b) DROP rate exceeds the saturation/hallucination threshold, (c) total API calls exceed budget, or (d) the same cell is top priority for two iterations in a row (convergence). The user no longer specifies "how many questions" — the loop generates until the corpus reaches a measurable steady state. Plus 25 round-1 visual questions generated by the new batched generator (5 batched calls × 5 cells each, zero failures). The loop is the answer to "we need balance, not just volume": every iteration's plan derives from a fresh analysis of where coverage is weakest, so generation can never over-fill an already-saturated cell.	2026-04-25 09:18:32 -04:00
Vijay Janapa Reddi	612885a952	refactor(vault): visual schema aligns with website + 5 more Gemini-generated visuals Schema fix: visual.kind is always 'svg' (the format the website ships) and visual.path points to that asset. The build-pipeline format is recorded as optional metadata in visual.source_format ('dot' \| 'matplotlib' \| 'hand'), which the website ignores. This separates "what users render" from "how maintainers built it". Source files live next to the SVG by naming convention; the renderer infers the path from the YAML's source_format hint without a dedicated source field. Five new visual exemplars generated by Gemini 3.1 Pro Preview, covering diverse archetypes: - cloud-2849 (DOT): incast-bottleneck topology - cloud-2850 (DOT): leaf-spine fabric with 2:1 oversubscription - cloud-2851 (matplotlib): bandwidth bar chart for data pipeline diagnosis - cloud-2852 (matplotlib): checkpoint/recovery timeline with RPO/RTO - edge-0972 (matplotlib): Poisson vs bursty queueing curves Plus the four prior exemplars (cloud-2846, 2847, 2848, tinyml-0816) re-emitted under the new schema. cloud-visual-001 unchanged — already had the correct shape. ARCHITECTURE.md rewritten to document the simpler three-layer separation (website / build / authoring).	2026-04-25 08:57:26 -04:00
Vijay Janapa Reddi	f435185671	feat(vault): Gemini 3.1 Pro question generator with optional visual archetypes gemini_cli_generate_questions.py mirrors gemini_cli_math_review.py's design: review-first, JSON-strict, model pinned to gemini-3.1-pro-preview with a hard guard against override. Targets weak coverage cells from the portfolio balance loop or explicit --target track:topic:zone:level cells. For visual-eligible topics (the 10 archetypes in audit_visual_questions.py), the generator also produces the diagram source artifact (DOT or matplotlib script) which render_visuals.py converts to a ship-ready SVG. This closes the generation→render→validate loop using two different model passes: Gemini drafts; the math review verifies. First generated example: tinyml-0816 (wake-word duty-cycle evaluation) with a matplotlib power-timeline visual. Math review returned CORRECT on the first call. Status remains draft pending broader cross-validation.	2026-04-25 08:47:41 -04:00
Vijay Janapa Reddi	38e5c99f17	feat(vault): multi-format visual question architecture (DOT + matplotlib + SVG) ARCHITECTURE.md establishes that visuals are a property of any question, not a separate category. Three supported formats let the layout engine do the work: DOT for graph topology, matplotlib for curves and Gantt charts, hand SVG for custom layouts. render_visuals.py is the single entry point that dispatches by visual.kind, runs the appropriate tool, and normalizes the rendered SVG to the book's font stack. It is idempotent and supports --dry-run. Three exemplars cover the three formats: - cloud-2846 (DOT): Tree AllReduce on 8 ranks — auto-laid-out topology - cloud-2847 (matplotlib): Queueing hockey-stick curve with SLO line - cloud-2848 (matplotlib): Pipeline-bubble Gantt for GPipe schedule All three are status:draft pending math review and promotion in a later batch. Existing cloud-visual-001 remains unchanged as the canonical hand-SVG exemplar.	2026-04-25 08:42:59 -04:00
Vijay Janapa Reddi	1898fe8c9a	feat(vault): add first visual-question exemplar + authoring guide Seeds the visuals/ directory with a reference pattern so future authors have a concrete template to clone. Exemplar: Ring AllReduce on 4 ranks (cloud track, L3, apply/analyze). - SVG follows .claude/rules/svg-style.md: 680×460 viewBox, Helvetica Neue, compute-blue ranks, orthogonal ring arrows, 10-px grid. - YAML wires the visual block (kind=svg, path=cloud-visual-001.svg, alt + caption) and pairs it with a matching question: 'Using the diagram, calculate the total time to complete the full AllReduce.' - The realistic_solution walks through 2(N−1)/N × data / bw and explains the common failure mode (forgetting the all-gather phase). Napkin math shows the step-time decomposition. AUTHORING.md: the when/how/why guide for future visual questions. - When a visual earns its place — three criteria (ask requires the diagram, encodes info text cannot, static suffices). - High-value candidate topics — ring/tree AllReduce, roofline, KV cache, pipeline bubbles, memory hierarchy, MCU memory maps, systolic arrays, attention, MoE. - Step-by-step authoring workflow pointing at the book's SVG style guide for the visual system — readers already know the visual vocabulary from the book, so consistency transfers. - Accessibility requirements (non-negotiable): alt is enforced by the Pydantic schema, colour never the sole semantic channel, text in <text> elements not paths, WCAG AA contrast. - Explicit anti-patterns: no inline SVG in YAML, no mermaid for non-graph content, no decorative effects, no label duplication of scenario prose.	2026-04-24 16:10:54 -04:00

8 Commits