20 of 20 workloads now schema-valid; 9 of 11 measurable workloads have
evidence-bound regime values backed by sidecars in roofline/. The
linter passes --verify-against-sidecars across the suite. 13 prior
guess-classifications were corrected by measurement; the surprises
(DLRM compute-bound, ResNet bandwidth-bound, Diffusion bandwidth-bound)
will inform paper prose. Branch parked.
Folds in: bench/measure_peaks.py (real per-machine peak FLOPS + BW
measurement), roofline.py reading from cache, manifest.py rejecting
dirty trees on closed division, check_taxonomy.py
--verify-against-sidecars flag, nanogpt_prefill emitting sidecars.
Empirical findings: hardcoded M1 peaks were 5.5-7.7x off for this
machine (M-series Pro/Max). The verify-against-sidecars flag caught
a YAML claim that didn't survive real measurement (nanogpt-prefill
dispatch claim was calibrated against wrong peaks).
Branch parked. 6 of 10 iterations complete (counting 5.5).
Snapshots iter-3 from the standalone repo. Adds:
- Real KV-cache plumbing in gpt2_infer.py (CausalSelfAttention,
GPTBlock, GPT2WhiteBox now support use_kv_cache + past_key_values).
- NanoGPTWhiteBox unified forward signature returning either
(logits, loss) for training or (logits, present_kvs) for inference.
max_seq_len bumped 1024 -> 2048 per Dean's sizing math.
- Two new workloads (nanogpt-prefill, nanogpt-decode) sharing the
same trained checkpoint. Prefill demonstrates compute-bound
behavior (~289 FLOP/byte at ctx=1792); decode demonstrates the
bandwidth-bound regime (~0.5 FLOP/byte) that dominates LLM serving.
- smoke_nanogpt_phases.py harness with intensity-ratio gate >= 5x;
measured 578x on M-series MPS.
Working group sign-off: Dean (proposer + verifier).
Branch parked; not for merge to dev. Three iterations complete; seven
remaining per the autonomous loop plan.
Snapshots the autonomous-iteration work happening in the standalone
/Users/VJ/GitHub/mlperf-edu/ repo. Two iterations folded in:
iter-1: code-defect cleanup (Patterson + Dean sign-off)
- Remove dead simulated_loss + load_real_wikitext_data from
nanogpt_train.py; align NanoGPTWhiteBox vocab to char-level
(50,257 -> 128, dropping 19.3M unused embedding params).
- Fix two broken examples.{edge,mobile} imports in inference paths.
- Reconcile README benchmark table with workloads.yaml (was wrong
on 7 of 16 workloads).
iter-2: DLRM DRAM-resident variant (Emer sign-off)
- New MicroDLRMDRAM with 2M-row hash-mapped virtual EmbeddingBag,
sized so per-batch byte transfer (8 MB at B=8192, m_spa=256)
exceeds PyTorch's ~50 us dispatch floor and exhibits the
bandwidth-bound regime production DLRM lives in.
- Smoke test asserts pure-lookup gap >= 3x; current host shows
4.29x end-to-end and 3.49x lookup-only.
Branch is parked; not for merge to dev. Iteration log lives in the
standalone repo under .iteration_log/ (gitignored locally).