mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-07 18:18:42 -05:00

Files

Vijay Janapa Reddi 6b2b3e0542 feat(vault): Phase D + F — parallelism gap closure (+87 PASS items)

Closes the parallelism + global L4-L6+ gaps that have been open across
three prior pushes. All gates green: vault check, lint, doctor, codegen,
validate-vault, render. Bundle: 9,688 → 9,775 published.

PARALLELISM GAP — finally closed:
  tinyml/parallelism:  1 → 8
  mobile/parallelism:  0 → 6
  edge/parallelism:   13 → 18
  global/parallelism:  0 → 19
  cloud/parallelism:  326 (unchanged; was already dense)

Phase D — parallelism + global generation (87 PASS):
D.1 Hand-authored 72 parallelism cells (track × parallelism-topic ×
    zone × level for edge/mobile/tinyml at L4-L6+) + 10 global L4-L6+
    cells. Bypasses the analyzer's topic-priority ranking which never
    surfaced parallelism cells in the top-100. Saved to
    tools/phase_d/{parallelism_targets.txt,global_targets.txt}.
D.2 PARALLELISM_RULES prompt variant in gemini_cli_generate_questions.py
    + --prompt-variant {default,parallelism} CLI flag. Adds rules:
      - FORBID single-step bandwidth division ("payload / bandwidth")
      - REQUIRE concrete interconnect (NVLink/IB/PCIe/RoCE/LoRa/SPI/BLE
        appropriate to track)
      - REQUIRE quantified synchronization or pipeline-bubble cost
      - REQUIRE non-obvious failure mode in common_mistake
      - For tinyml: ground in real numbers (Cortex-M4 SPI 5-25 MHz,
        LoRa 5-50 kbps)
    + --targets-from <file> CLI flag for hand-authored target lists.
    + parse_target() now sets competency_area from TOPIC_TO_AREA
      mapping (was hardcoded to "cross-cutting").
D.3 Generator: 72/72 written, **0 validate-at-write failures**, 3 API
    calls (no retries needed). Judge: 58 PASS / 12 NEEDS_FIX / 2 DROP
    = **80.6% pass rate** (vs B.5's 51% on standard cells). PARALLELISM
    prompt + validate-at-write together drove the rate up by 30pts.
D.4 Spot-read: 16 stratified PASS items (ran out at 16, no cloud since
    D.1 skipped that track). 0% rejection rate, all show real topology
    + quantified sync cost + correct math.
D.5 Global generator: 10/10 written, 0 validate failures, 1 API call.
    Judge: 6 PASS / 3 NEEDS_FIX / 1 DROP = 60% pass rate. Filled
    global cells (global-0432..0441).
D.6 Promote, rebuild bundle, repair registry, update manifest.

Phase E.1 — retry-on-validation-fail in generator:
  Single retry with structured error context for validate-at-write
  rejections. Cap at 1 retry per batch. NOT triggered in this run
  (D.3 + D.5 had 0 failures), but in place for future runs that
  might face the iter-1/iter-3 zero-draft pattern from B.5.

Phase F — second-pass NEEDS_FIX/DROP rehab (23 PASS):
F.2 Spawned general-purpose fix-agent on 33 items (13 NEEDS_FIX + 20
    DROP from C.3's first re-judge). 33/33 rewritten with deeper
    revisions: visual-aligned reframings, math corrections, real
    track-specific toolchains (Hailo-8 DFC, TensorRT 8.6 calibrators,
    Cortex-X4 NEON SDOT vs Hexagon NPU), unrealistic-premise fixes
    (KV cache in NPU SRAM → tiered LPDDR5/TCM scheme).
F.1 Re-judge: 23 PASS / 4 NEEDS_FIX / 6 DROP = **69.7% pass rate** on
    items previously rated NEEDS_FIX or DROP. The fix-agent's deeper
    rewrites recovered 70% of the carry-forward queue.
F.3 Stratified spot-read of 16 PASS items (parallel-safe with F.1):
    0% rejection rate. Standout: tinyml-1817 correctly diagnoses 2x
    half-duplex UART penalty by comparing observed to theoretical Ring
    AllReduce time.

Cleanup:
- repair_registry.py: appended 87 new IDs (D.3 + D.5 + F.1 outputs).
- vault-manifest.json refreshed: 9,688 → 9,775; track + level
  distributions updated; contentHash dccd3073672c.

API budget: ~12 calls used of 70 allotted (3 D.3 gen + 3 D.3 judge
+ 1 D.5 gen + 1 D.5 judge + 2 F.1 judge + 1 sample = 11). Far under
budget thanks to validate-at-write driving 0 retry calls.

The corpus is StaffML-day-ready with the parallelism gap genuinely
closed for the first time. The remaining 13 NEEDS_FIX + 6 DROP from
F.1 are deferred to a future cleanup; they don't block release.

2026-04-25 18:31:58 -04:00

13 KiB

Raw Permalink Blame History

Phase D F.2 — Second-Pass Fix Agent Report

Date: 2026-04-25

Totals

Manifest size: 33 items
Rewritten (substantive revision): 33
Archived: 0
Skipped: 0
All YAMLs validated cleanly via yaml.safe_load and required-field check
All competency_area values within the closed enum
All (zone, bloom_level) pairs within ZONE_BLOOM_AFFINITY
All visual blocks preserved (kind, path, alt, caption unchanged)
All status: draft preserved (no promotions)

Strategy notes

Visual block treated as immutable per task contract. When the diagnostic was visual-alignment (10 items), the fix path was to rewrite scenario / question / realistic_solution so the existing visual alt aligned, typically by framing the visual as the "naive baseline" being challenged in the rewritten question (edge-2357, edge-2367, edge-2364 secondary, mobile-1891, mobile-1903, mobile-1982, tinyml-1562) or by reshaping the scope of the question to match what the visual actually shows (mobile-1897 narrowed to uplink-only, mobile-1881 added NPU as a downstream stage so the Cloud->Phone->NPU alt fits, mobile-1896 reframed to use the visual's stacked bar as the "wrong answer" the student must correct).
For DROP items, applied substantive rewrites (new scenario or new angle on the same topic) to surface a viable testable concept.
For NEEDS_FIX items where the prior fix was insufficient, deepened the cycle-accurate / mathematical / hardware-specific content rather than cosmetic edits.

Per-item summary

edge-2357 (NEEDS_FIX, visual_alignment WARN)

Reframed scenario to start FROM the fragmentation visual as the naive view, then escalate the question to TLB pressure and page-walk overhead on Orin's ~512-entry iGPU MMU. Visual now reads as the launching point, not a contradiction.

edge-2364 (NEEDS_FIX, visual_alignment WARN)

Repositioned the saturated-switch visual as the actual binding constraint by specifying a 1 GbE TOR switch (4-camera aggregate of 480 MB/s saturates 125 MB/s switch 3.8x). Solution now layers in host-memory ingestion as the next constraint after a network upgrade — matches both the visual and the original diagnostic intent.

edge-2367 (DROP, visual_alignment ERROR)

Visual showed overlapping pipeline; solution is serial. Reframed question to explicitly use the overlapping-pipeline figure as the textbook upper-bound (66.6 FPS) being contrasted against the actual single-fabric multi-context serial result (36.5 FPS, gap = 45%).

edge-2390 (DROP, scenario_realism ERROR)

Template-injection glitch already cleaned in prior pass. Substantive rewrite: specified i.MX 8M Plus host paired with Hailo-8, computed raw-frame DRAM traffic (1.49 GB/s one-way, ~4.5 GB/s pipeline) vs the trivial 25 Mbps H.265 bitrate, and grounded the design in software vs hardware video decode.

edge-2401 (NEEDS_FIX, uniqueness WARN)

Hard-grounded in Hailo-8 DFC toolchain semantics — per-tensor symmetric INT8, no per-channel activation quantization, SiLU outliers (max 84 vs 99.9th percentile 12) — and computed exact INT8 scale and effective-bits per calibrator. No longer generic Min-Max-vs-Entropy.

edge-2402 (NEEDS_FIX, uniqueness WARN — too similar to edge-2370)

Pivoted from "power-mode bandwidth selection" to "thermal throttle floor for roadside SLA". Question now integrates BPMP DVFS clamp (32% sustained) with LPDDR5 efficiency degradation and recommends heatpipe vs model trim. Different lever from edge-2370.

edge-2406 (DROP, math_correct ERROR)

Math was utilization-divided instead of utilization-multiplied. Rewrote solution: PCIe-bound 317 FPS, compute-bound 364 cold / 255 throttled. The binding constraint flips from PCIe to compute at the 8-min thermal mark — a more interesting result than the original (which had no flip). Common_mistake now explicitly calls out the 26000/50/0.7 = 743 division error.

edge-2416 (DROP, uniqueness ERROR — duplicate of edge-2363)

Pivoted from M/D/1 vs M/M/1 to a Coral TPU static-batching analysis (B=1, 2, 4) with fill-time penalty for batches. Different question structure entirely; demonstrates that largest batch is not always best despite higher throughput ceiling.

edge-2424 (NEEDS_FIX, uniqueness WARN)

Grounded specifically in TensorRT 8.6 toolchain on Orin: which calibrator (IInt8MinMaxCalibrator vs IInt8EntropyCalibrator2 vs IInt8LegacyCalibrator), plus per-layer precision overrides, FP16 mixed-precision fallback, DLA offload via --useDLACore. No longer generic QAT vs PTQ.

mobile-1870 (DROP, math_correct ERROR — judge said K=11 yields 0.0077)

Recomputed M/M/1/K blocking probabilities. Min K = 11 (P_11 = 0.00770), not 12 as solution claimed. Updated solution with full sweep K=9..12 and corrected napkin math.

mobile-1881 (NEEDS_FIX, visual_alignment WARN)

Visual shows Cloud->Phone->NPU. Added the NPU mapped-ingest stage (6 GB/s) as the fourth pipeline stage, expanding the question to a 4-stage analysis. The NPU stage doesn't change the binding constraint (download still wins) but makes the visual align.

mobile-1891 (DROP, visual_alignment ERROR)

Visual showed 75 ms; correct answer 400 ms. Reframed question so the figure's 75 ms is explicitly the naive bandwidth-only baseline being challenged by serialization tax + UFS contention + parallelization analysis.

mobile-1896 (NEEDS_FIX, visual_alignment WARN)

Stacked bar chart was misleading for a harmonic-mean question. Reframed question to use the bar chart's arithmetic-mean reading (92 GB/s) as the junior-engineer wrong answer the student must correct (88.2 GB/s).

mobile-1897 (NEEDS_FIX, visual_alignment WARN)

Visual shows uplink-only star topology. Narrowed question scope to client->host uplink only, dropping the downlink limb that the visual didn't show.

mobile-1903 (DROP, visual_alignment ERROR)

Visual showed sequential Gantt; solution requires double-buffered overlap. Reframed to use the figure's sequential 30.3 FPS as the v1 baseline that double-buffering must improve upon (62.5 FPS = 2.13x uplift).

mobile-1918 (DROP, math_correct ERROR — energy units wrong)

Fixed energy units throughout: ms*W = mJ, never mWh for sub-second windows. Reworked the policy comparison: continuous (100 mJ/window, 33 ms latency) vs batched (15.85 mJ/window, 115 ms worst-case latency). Energy ratio ~6.3x.

mobile-1929 (DROP, cell_fit WARN, uniqueness WARN)

Pivoted from "duty-cycle vs always-on" (overlaps mobile-1918) to "wake/sleep break-even rate analysis": full charge-up energy model with 2 ms wake at 1 W

0.8 ms compute + 0.5 ms sleep flush, sweeping {1, 5, 10, 30} Hz. Break-even vs always-on at 303 Hz; recommends 10 Hz matching UX (42.7 mJ/s, 23x better than always-on).

mobile-1948 (DROP, scenario_realism ERROR)

Original scenario implied placing entire LLM KV in NPU SRAM (impossible). Rewrote to a tiered scheme: 96 MB KV in LPDDR5 vs sliding-window in 8 MB Hexagon TCM. Hit rate ~8% gives only ~8% energy savings; INT4 K/V quantisation is the bigger lever.

mobile-1949 (DROP, scenario_realism WARN, uniqueness WARN)

Replaced generic CPU vs NPU template with hard numbers: Cortex-X4 NEON SDOT (27 GOPS @ 2 W = 13.5 GOPS/W peak) vs Hexagon NPU (12.86 GOPS/W peak) and realistic 10 GOPS demand-point efficiency (10 GOPS/W vs 12.8 GOPS/W). Shows the NPU's lead is only 1.3x at this demand, widens dramatically as workload scales.

mobile-1982 (DROP, visual_alignment ERROR)

Visual claimed 300 ms drain; correct answer 2720 ms. Reframed so the figure's linear 300 ms drain is the naive baseline (15 frames * 20 ms) and the question asks for the realistic worst-case under cold-cache + variance + open-loop arrivals.

mobile-1995 (DROP, math_correct ERROR — ignored 800us preempt + 18% throttle)

Rewrote solution to honour both costs: FIFO tail = 50 ms cold / 61 ms throttled (both fail 16.6 ms); priority+preempt tail = 5.8 ms cold / 6.9 ms throttled (both pass). 800 us tax is 4.8% of budget.

mobile-2025 (NEEDS_FIX, cell_fit WARN — L6+ but trivial math)

Upgraded from "25M4 - 25M0.5 = 87.5 MB savings" to a full mixed-precision SLC-resident allocation: per-tier weight footprint (4M FP16 embedding + 3.2M INT8 outer attention + 17.8M INT4 bulk = 20.1 MB) plus 4 MB activation double-buffer plus 6 MB OS reservation, yielding 73.4 effective TOPS. Three constraints to balance simultaneously now match L6+ optimization complexity.

mobile-2028 (DROP, math_correct ERROR — never computed fallback budget)

Added the missing fallback computation: 0.7 * 25 + 0.3 * 25 * 1.4 = 28 MFLOPs/frame -> 3.36 GFLOPs/s (12% increase over INT8 nominal 3.0 GFLOPs/s). Throughput is trivially under-provisioned; thermal envelope at 1.4 W is the real binding constraint.

tinyml-1562 (NEEDS_FIX, visual_alignment WARN)

Visual shows 1000 MB/s raw line rate; correct effective is 886 MB/s. Reframed question so the figure's optimistic 1000 MB/s view is the input the student must derate by 128b/130b + TLP overhead (886 MB/s effective).

tinyml-1634 (DROP, math_correct WARN, scenario_realism WARN)

Replaced unrealistic 100 KB / 5 ms (20 MB/s) flash speed with realistic 8 KB / 80 ms (100 KB/s) STM32L4 numbers. Built full energy-budget comparison: 5-min cadence wins 273 mJ/hr vs 691 mJ/hr (2.5x), with analytic optimum at 2.31 s.

tinyml-1652 (DROP, scenario_realism WARN, uniqueness WARN)

Pivoted from "Ring AllReduce on UART-daisy-chained Cortex-M4" to a realistic "BLE mesh federated embedding-reduce on nRF52840 nodes" with 3 uJ/byte radio TX energy. Same 800 B per-node math, now grounded in a real TinyML federated- sensor scenario.

tinyml-1661 (NEEDS_FIX, math_correct WARN)

Added the brownout-voltage threshold (1.8 V BOR) to the energy formula and showed both the charge-budget approach (3.2 mC required vs 2.4 mC available) and the energy approach (5.76 mJ usable vs 7.68 mJ required). Both views agree the original 2 mF cap fails by ~25%; recommended C_min = 2.67 mF.

tinyml-1681 (NEEDS_FIX, cell_fit WARN — L6+ but 1.2x heuristic)

Replaced the 1.2x MAC heuristic with a true cycle-accurate cost model: T_total = T_load + T_MAC + T_spill + T_pad + T_store, decomposed for Cortex-M4 SMLAD with explicit register-pressure analysis (16 in-channels in 8 packed pairs forces 2-pair spill). Final answer: 5912 cycles vs heuristic 5760 (2.6% under), with explicit regimes where the heuristic loses fidelity.

tinyml-1716 (DROP, math_correct ERROR — ignored 8% DMA contention)

Reworked solution to honour the 8% DMA-contention tax: ideal 100 fps drops to 94 fps (1.88x speedup vs ideal 2.0x). Identifies that single-bank chip already gets 94% of the dual-bank benefit, justifying or rejecting the dual-bank cost premium based on tail-sensitivity.

tinyml-1721 (DROP, math_correct ERROR — ignored leakage and lifetime)

Added the constant 3 uA leakage (correctly small at 0.3% of budget) and the non-linear discharge derate from 250 mAh nameplate to 150 mAh effective: naive lifetime 247 hr vs realistic 148 hr (40% shorter). Active compute dominates at 98.8% of total current.

tinyml-1723 (DROP, math_correct ERROR — single-buffer sum instead of overlap)

Rewrote to a true triple-buffer (three-frames-in-flight) accounting: 3 * (2 + 12 + 0.5) KB = 43.5 KB (~17% of 256 KB SRAM), vs ping-pong's 29 KB (~11%). Sensitivity analysis shows that a 64 KB activation arena pushes triple-buffering to 78% of SRAM (infeasible), forcing ping-pong fallback.

tinyml-1724 (DROP, math_correct ERROR — ignored which retention mode wins)

Full charge-budget comparison of RAM retention vs cold-boot under both 50 ms and 500 ms cadences. RAM retention wins both regimes (2.81 vs 3.05 mA unbatched; 2.09 vs 2.11 mA batched). Computed break-even sleep period = 12 uC / 5 uA = 2.4 s, above which cold-boot becomes the right choice.

tinyml-1732 (NEEDS_FIX, cell_fit WARN — didn't address half-transfer interrupt mechanics)

Replaced high-level "DMA frees the CPU" answer with full HTIF/TCIF circular- DMA mechanics: pointer-swap on each interrupt, deadline = N / sample_rate (32 ms at 16 kHz, N=512), and explicit silent-overrun failure mode if compute_time > per-half deadline.

Concerns / caveats for re-judge

Some YAMLs needed plain-scalar -> single-quoted scalar conversions to avoid mapping values not allowed here errors when : patterns appear in technical prose (edge-2416, mobile-1897, tinyml-1723). Quoted scalars verified to parse cleanly.
mobile-1897 had an editing artifact (duplicate details: key) which was caught and corrected mid-pass. Final structure verified.
mobile-2025 is L6+ optimization; the rewrite expands the cognitive load substantially (multi-constraint mixed-precision allocation with throughput maximisation), which may push expected_time_minutes higher than the current 5; left unchanged per scope rules but flagged for the reviewer.
edge-2367 (L1 recall) — reframing the visual as a "textbook baseline" is consistent with the visual's content but requires the reader to do a gap-analysis (66.6 FPS upper bound vs 36.5 FPS realised), which is slightly above pure recall. The bloom_level is remember and the zone is recall per the locked schema axes; the reviewer may want to verify this still fits L1.
tinyml-1652 kept the 3-node ring topology to align with the visual, but moved from "UART-daisy-chained Cortex-M4" to "BLE mesh on nRF52840" to land a realistic federated-sensor scenario. The visual alt ("three-node cyclic ring topology") still aligns.

13 KiB Raw Permalink Blame History