Files
cs249r_book/interviews/vault/scripts/review_math.sh
Vijay Janapa Reddi eb71638630 feat(vault): release-grade Phase G — full audit + cleanup + 0.1.3 release
Final brute-force release-readiness pass: every gate green, 0.1.3
released and verified, every observable failure mode closed at source.

═══ AUDITS (G.A–G.D) ═══

G.A — gemini-3.1-pro-preview default everywhere. Active CLI scripts
    already used it; bulk-patched 6 legacy scripts (`generate_batch.py`,
    `validate_questions.py`, `generate_gaps.py`, `run_reviews.sh`,
    `generate.py`, `review_math.sh`) + WORKFLOW.md off `gemini-2.5-flash`
    or `gemini-2.5-pro` to `gemini-3.1-pro-preview`. Only `archive/`
    references remain (intentionally legacy).

G.B — Cloudflare workflow audit. `vault verify 0.1.1` correctly
    failed (YAMLs evolved since 0.1.1 cut). Confirmed `vault publish`,
    `vault deploy`, `vault ship`, `vault rollback`, `vault verify`,
    `vault snapshot`, `vault tag` all wired. Released 0.1.2 then 0.1.3
    to lock final state.

G.C — Visual asset integrity audit. 236/236 YAML visual references
    resolve, 0 orphan SVGs, 0 missing files, 0 unrendered sources.
    Clean.

G.D — Unit tests for new validators added at `tests/test_models.py`:
    15 tests covering Visual.kind enum, Visual.path regex, Visual.alt
    + caption min lengths + required, Question._zone_bloom_compatible
    (recall+remember accepted, recall+evaluate rejected, mastery+
    remember rejected, evaluation+evaluate accepted, design+create
    accepted), Question._visual_path_resolves. **15/15 pass.**

═══ CONTENT CLEANUP (G.E–G.L) ═══

G.E — Sample re-judge of 100 random cloud parallelism items via
    Gemini 3.1 Pro Preview (4 API calls): 53% PASS / 23% NEEDS_FIX /
    24% DROP. Surfaced legacy quality drift — items generated under
    pre-Phase-D laxer prompts were not meeting the new strict bar
    (math errors with bidirectional vs unidirectional NVLink,
    "Based on the diagram..." references with no diagram, deprecated
    practices like SSP for modern LLM training, wrong-track scenarios
    like Cortex-M4 in cloud track).

G.H — General-purpose cleanup agent on 47 flagged items:
    **31 rewritten** with PARALLELISM_RULES bar applied (concrete
    unidirectional NVLink 450 GB/s, IB NDR 25 GB/s, RoCE v2 22 GB/s,
    PCIe Gen3 12 GB/s; multi-step ring AllReduce arguments with the
    2(N-1)/N factor; non-obvious failure modes); **16 archived** with
    documented `deletion_reason` (mathematically broken premises,
    physics errors, topic-irreconcilable, direct duplicates).

G.L — Re-judge of 31 G.H rewrites: **23 PASS / 3 NEEDS_FIX / 5 DROP =
    74.2% pass rate**. The 8 still-failing items archived (after the
    cleanup pass still couldn't satisfy the strict bar). Contract:
    items get THREE chances — original generation, fix-agent, retry-
    fix — and if they still fail, archived not promoted. Honest.

═══ STUBBORN-FAIL ARCHIVES (Phase F residuals) ═══

After three independent fix-agent passes (Phase C, F.2, F.4), 4 items
remained NEEDS_FIX or DROP: edge-2390, edge-2401, mobile-1948,
tinyml-1681. Archived with `deletion_reason` documenting the 3-attempt
failure history. The cell may be structurally awkward; preserving
items for audit but removing from the bundle.

═══ ORPHAN CHAIN FIX ═══

After archives, `cloud-chain-359` had only 1 published member
(`cloud-1840`); its sibling `cloud-1845` got archived. Dropped the
chain ref from cloud-1840 + ran `repair_chains.py` to clean residual
references in archived YAMLs. `vault check --strict` now passes 0
chain warnings.

═══ E.2 / E.3 SHIPPED EARLIER IN PRIOR COMMIT ═══

(Documented in commit `20ea20005` for completeness):
- `vault build --legacy-json` auto-emits `vault-manifest.json`.
- `analyze_coverage_gaps.py --include-areas <areas>` flag.

═══ 0.1.3 FINAL RELEASE ═══

`vault publish 0.1.3` snapshot at `releases/0.1.3/`. Migrations:
+0 ~27 -28 (zero net new questions, 27 modified during cleanup, 28
archived/promoted). `vault verify 0.1.3` ✓ — release_hash
`793c06f414f2bf8391a8a5c56ec0ff8d76bfce4ab7c64ad12ecb83f6d932280e`
reconstructs from YAML. Latest symlink → 0.1.3.

═══ FINAL ALL-9-GATES SWEEP — ALL GREEN ═══

[1] vault check --strict          ✓ 10,701 / 0 errors / 0 invariants
[2] vault lint                    ✓ 0 errors / 0 warnings / 9,757 info
[3] vault doctor                  ✓ 0 fails (registry-history info OK)
[4] vault codegen --check         ✓ artifacts in sync
[5] vault verify 0.1.3            ✓ hash reconstructs from YAML
[6] staffml validate-vault        ✓ 0 errors / 0 warnings, deployment-ready
[7] render_visuals                ✓ 236 visuals, 0 errors
[8] tsc                           ✓ TypeScript clean
[9] Playwright                    ✓ 9/9 pass

═══ FINAL CORPUS STATE ═══

Bundle: 9,757 published (was 9,224 at branch cut, **+533 net** across
the full multi-session push, after all archives).

Total commits on branch since cut: 10.
Release tag latest: 0.1.3 (verified-clean).
Status: StaffML-day-ready. Ship it.
2026-04-25 19:45:32 -04:00

136 lines
5.1 KiB
Bash
Executable File

#!/bin/bash
# ═══════════════════════════════════════════════════════════════
# Math Review: Launch parallel Gemini reviews of all questions
# Uses gemini-3.1-pro-preview to verify napkin math in every question
# ═══════════════════════════════════════════════════════════════
set -euo pipefail
MODEL="${GEMINI_MODEL:-gemini-3.1-pro-preview}"
INTERVIEWS_DIR="$(cd "$(dirname "$0")" && pwd)"
REPORTS_DIR="${INTERVIEWS_DIR}/_reviews"
CHUNKS_DIR="/tmp/staffml-review-chunks"
MAX_PARALLEL="${MAX_PARALLEL:-8}"
mkdir -p "$REPORTS_DIR" "$CHUNKS_DIR"
# Split large markdown files into chunks of ~50 questions each
echo "═══ Splitting files into review chunks ═══"
split_file() {
local file="$1"
local basename="$(basename "$file" .md)"
local dirname="$(basename "$(dirname "$file")")"
local prefix="${dirname}-${basename}"
# Split on <details> blocks — each is one question
python3 -c "
import sys
content = open('$file').read()
blocks = content.split('<details>')
header = blocks[0]
questions = ['<details>' + b for b in blocks[1:]]
chunk_size = 50
chunk_num = 0
for i in range(0, len(questions), chunk_size):
chunk = questions[i:i+chunk_size]
chunk_num += 1
outfile = f'$CHUNKS_DIR/${prefix}-chunk{chunk_num:02d}.md'
with open(outfile, 'w') as f:
if chunk_num == 1:
f.write(header)
f.write('\n'.join(chunk))
print(f' {outfile}: {len(chunk)} questions')
"
}
# Process all question files
for track in cloud edge mobile tinyml; do
for file in "${INTERVIEWS_DIR}/${track}"/*.md; do
[[ "$(basename "$file")" == "README.md" ]] && continue
split_file "$file"
done
done
CHUNK_COUNT=$(ls "$CHUNKS_DIR"/*.md 2>/dev/null | wc -l | tr -d ' ')
echo ""
echo "═══ ${CHUNK_COUNT} chunks ready for review ═══"
echo "═══ Model: ${MODEL} | Max parallel: ${MAX_PARALLEL} ═══"
echo ""
# The review prompt
REVIEW_PROMPT='You are a math verification expert reviewing ML Systems interview questions.
For EACH question in the file below, check:
1. **Napkin Math Accuracy**: Are the calculations correct? Check arithmetic, unit conversions, order of magnitude.
2. **Hardware Specs**: Are GPU/TPU/MCU specs realistic? (e.g., H100 = 80GB HBM3, 3.35 TB/s bandwidth, 989 TFLOPS FP16)
3. **Answer Consistency**: Does the napkin math in the answer actually support the conclusion?
4. **Unit Errors**: GB vs GiB, MB/s vs Mb/s, FLOPS vs FLOPs, etc.
5. **Logical Errors**: Does the "common mistake" actually describe a real mistake? Does the "realistic solution" make sense?
OUTPUT FORMAT — For each error found, output EXACTLY:
```
ERROR | <question-title> | <error-type> | <what-is-wrong> | <correct-value>
```
For warnings (not wrong but could be better):
```
WARN | <question-title> | <issue-type> | <description>
```
If a question is correct, output:
```
OK | <question-title>
```
Review ALL questions. Do not skip any. Be rigorous but fair — napkin math is approximate by nature, so flag only genuine errors (>2x off for cloud, >50% off for tinyml).'
# Launch reviews in parallel
review_chunk() {
local chunk="$1"
local chunk_name="$(basename "$chunk" .md)"
local report="${REPORTS_DIR}/${chunk_name}-review.txt"
echo "[START] ${chunk_name}"
if gemini -m "$MODEL" "${REVIEW_PROMPT}" < "$chunk" > "$report" 2>/dev/null; then
local errors=$(grep -c "^ERROR" "$report" 2>/dev/null || echo 0)
local warns=$(grep -c "^WARN" "$report" 2>/dev/null || echo 0)
local oks=$(grep -c "^OK" "$report" 2>/dev/null || echo 0)
echo "[DONE] ${chunk_name}: ${errors} errors, ${warns} warnings, ${oks} OK"
else
echo "[FAIL] ${chunk_name}: Gemini call failed"
echo "GEMINI_CALL_FAILED" > "$report"
fi
}
export -f review_chunk
export MODEL REPORTS_DIR REVIEW_PROMPT
# Run with GNU parallel or xargs fallback
if command -v parallel &>/dev/null; then
ls "$CHUNKS_DIR"/*.md | parallel -j "$MAX_PARALLEL" review_chunk
else
ls "$CHUNKS_DIR"/*.md | xargs -P "$MAX_PARALLEL" -I {} bash -c 'review_chunk "$@"' _ {}
fi
echo ""
echo "═══ Review complete! ═══"
echo "Reports in: ${REPORTS_DIR}/"
echo ""
# Summary
echo "═══ Summary ═══"
total_errors=$(grep -rch "^ERROR" "$REPORTS_DIR"/*-review.txt 2>/dev/null | paste -sd+ - | bc 2>/dev/null || echo 0)
total_warns=$(grep -rch "^WARN" "$REPORTS_DIR"/*-review.txt 2>/dev/null | paste -sd+ - | bc 2>/dev/null || echo 0)
total_ok=$(grep -rch "^OK" "$REPORTS_DIR"/*-review.txt 2>/dev/null | paste -sd+ - | bc 2>/dev/null || echo 0)
echo " Errors: ${total_errors}"
echo " Warnings: ${total_warns}"
echo " OK: ${total_ok}"
echo ""
# Aggregate all errors into one file
echo "═══ All Errors ═══" > "${REPORTS_DIR}/ALL_ERRORS.txt"
grep -rh "^ERROR" "$REPORTS_DIR"/*-review.txt >> "${REPORTS_DIR}/ALL_ERRORS.txt" 2>/dev/null || echo "(none)"
echo ""
echo "Full error list: ${REPORTS_DIR}/ALL_ERRORS.txt"