mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-07-15 21:28:33 -05:00

Files

Rocky c46ec4734d fix(mlperf-edu): anomaly-ae-train model-size gate impossible to pass (#1937 )

max_model_size_kb: 32 could never be satisfied by the reference model it
gates: AnomalyDetectionAE (640/784 -> 128x4 -> 8 -> 128x4 -> 640) has
~266K parameters -- matching this same file's own `params: 0.3M` -- which
is ~1.03MB at FP32 and ~260KB even fully INT8-quantized. A submitter who
trains exactly this reference model and reports the true model size would
automatically fail the size gate regardless of reconstruction quality.

Corrected the budget to 300KB, consistent with the file's own stated 0.3M
parameter count at 1 byte/param (INT8), the standard deployment target for
this suite's "microcontroller" framing.

2026-07-15 11:47:33 +02:00

bench

…

examples

…

labs/data_quality

…

paper

…

reference

…

registry

fix(mlperf-edu): anomaly-ae-train model-size gate impossible to pass (#1937 )

2026-07-15 11:47:33 +02:00

review_packets

…

scripts

fix(mlperf-edu): stop silently bypassing anti-cheat when mlperf CLI is missing (#1933 )

2026-07-15 11:47:18 +02:00

src

fix(mlperf-edu): SLM decode metrics polluted by generate()'s internal prefill (#1935 )

2026-07-15 11:47:29 +02:00

tests

…

tools

…

.gitignore

…

DATASET_RELEASE_REVIEW.md

…

datasets.yaml

…

DESIGN_PHILOSOPHY.md

…

INSTALL.md

…

ITERATION_LOOP.md

…

NORTH_STAR.md

…

PROPOSAL.md

…

PUBLIC_RULES.md

…

pyproject.toml

…

QUALITY_TARGET_REVIEW.md

…

README.md

…

RELEASE_CHECKLIST.md

…

SPEC.md

…

TODAY_IMPLEMENTATION_PLAN.md

…

workloads.yaml

…

README.md

Warning

🚧 Under construction

This tree is a runnable preview, not a stable public benchmark release. The mlperf CLI, registry, reports, and validation paths work locally, but public-result policy, dataset approval, and MLCommons endorsement review are still in progress. Do not rely on it for production benchmarking until we publish a stable "1.0" teaching release.

Note

📌 Early work (2026)

MLPerf EDU is being developed in public alongside the 2026 MLSysBook ecosystem. Harness scripts, compliance checks, and teaching notes will keep moving as we align workloads with the core curriculum.

Feedback — GitHub issues or pull requests (especially if something in this README is wrong or outdated).

MLPerf EDU 🎓

A 30-workload pedagogical ML systems benchmark registry with runnable min/max coverage and a pro research envelope, aligned with MLCommons MLPerf.

MLPerf EDU brings industry-standard ML benchmarking into teaching and research. The core teaching models are self-contained, white-box PyTorch nn.Module implementations, while the SLM suite uses off-the-shelf Hugging Face models for local serving, quantization, LoRA, and backend studies.

📄 Paper: See paper/paper.tex — "MLPerf EDU: Bridging Industry Benchmarking and ML Systems Education"

🧭 North star: See NORTH_STAR.md for the two-year goal: MLPerf EDU as the SPEC-like, runnable academic benchmark substrate for ML systems papers.

📦 Install: See INSTALL.md for the uv sync, uv tool install, and uv build package workflow.

📋 Product contract: See SPEC.md for the CLI, suite/profile vocabulary, backend policy, and validation presets that keep this tree runnable from a fresh clone.

🗂️ Workload registry: See registry/ for the native suite/workload/variant metadata layout. workloads.yaml is kept as a generated compatibility mirror.

🔁 Iteration loop: See ITERATION_LOOP.md for how we collect student, instructor, researcher, MLCommons, and maintainer feedback without confusing the user-facing product.

⚖️ Public rules: See PUBLIC_RULES.md for score-bearing, performance-bearing, systems-only, and scenario promotion rules.

🧾 Dataset release review: See DATASET_RELEASE_REVIEW.md for the public dataset decisions that remain before endorsement.

🎯 Quality target review: See QUALITY_TARGET_REVIEW.md for the expert-review matrix behind score-bearing and performance-bearing rows.

🚢 Release checklist: See RELEASE_CHECKLIST.md for the packaging and endorsement release bars.

📝 MLCommons proposal: See PROPOSAL.md for the endorsement path and staged review plan.

Quick Start: Run A Benchmark

# Clone and install
git clone https://github.com/harvard-edge/cs249r_book.git
cd cs249r_book/mlperf-edu
uv sync --extra dev

# Check that this machine can run MLPerf EDU
uv run mlperf doctor

# See available workloads
uv run mlperf list
uv run mlperf list matrix --profile max
uv run mlperf info --dataset tinyshakespeare
uv run mlperf info --model smollm2-135m

# Run the smallest local confidence path
uv run mlperf init --profile min

# Run the max benchmark profile
uv run mlperf fetch --profile max
uv run mlperf run --profile max --open-report

Every run writes JSON, HTML, and CSV reports. --open-report opens the HTML report in your browser. Add --power when you want aggregate estimated watts and joules without privileged hardware counters. Reports include hardware/software fingerprints, dataset/model asset dossiers, checkpoint dependencies, quality-required status, and provenance links. Use mlperf report <run-directory> --format html --open to open the latest report from a run directory without copying the timestamped JSON filename.

Selection rule: a bare --profile min|max|pro selects that default profile path. --suite selects a workload domain. --workload <canonical> selects all variants under that workload family, and --variant <name> narrows to one.

Common Runs

# Minimal smoke profile
mlperf run --profile min --dry-run
mlperf run --profile min

# Max-profile score-bearing training benchmark
mlperf fetch --workload nanogpt-train --profile max
mlperf run --workload nanogpt-train --profile max --open-report

# Verify, inspect, and package a benchmark result
mlperf verify submissions/nanogpt-train_max.provd.json
mlperf report submissions/nanogpt-train_max_report.json --format html --open
mlperf report submissions/nanogpt-train_max_report.json --format csv
mlperf report submissions --format html --open
mlperf package submissions/nanogpt-train_max.provd.json

# Checkpoint-backed max-profile inference on the trained NanoGPT checkpoint
mlperf run --workload nanogpt-inference --variant prefill --profile max
mlperf run --workload nanogpt-inference --variant decode --profile max

# Max-profile recommender benchmark
mlperf fetch --workload micro-dlrm-train --profile max
mlperf run --workload micro-dlrm-train --profile max

# Max-profile tiny anomaly benchmark
mlperf fetch --workload anomaly-ae-train --profile max
mlperf run --workload anomaly-ae-train --profile max

# Max-profile vision benchmark
mlperf fetch --workload resnet18-train --profile max
mlperf run --workload resnet18-train --profile max

# Off-the-shelf SLM decode suite
mlperf fetch --suite slm --profile max --dry-run
mlperf run --workload smollm2-chat-inference --variant baseline --profile max --model smollm2-135m
mlperf run --workload smollm2-chat-inference --variant quantized-int8 --profile max --model smollm2-135m
mlperf run --workload smollm2-chat-inference --variant batched-b4 --profile pro --model smollm2-135m
mlperf run --workload smollm2-chat-inference --variant long-context --profile pro --model smollm2-135m

# Research-envelope profile
MLPERF_EDU_PRO_REPETITIONS=1 mlperf run --workload nanogpt-train --profile pro

Instructor And Maintainer Commands

These commands validate the suite itself, check public-result metadata, or grade submissions. Most students and paper readers should not need them for a normal benchmark run.

# Audit workload labels and public-result contracts; does not run benchmarks
mlperf audit
mlperf audit --policy public

# Run bundled validation presets; these execute workloads and grade artifacts
mlperf validate smoke
mlperf validate coverage
mlperf validate max
mlperf validate release --output-dir submissions/validation

# Grade a submissions directory
mlperf grade submissions --output submissions/grade.json

# Run tests
pytest

The installed command is mlperf; this package defaults that command to the mlperf-edu benchmark suite. mlperf-edu may also be installed as a compatibility alias, but public instructions should use mlperf.

The benchmark profiles are min, max, and pro: MIN checks, MAX benchmarks, and PRO explores. Validation presets are named by intent so they do not collide with profile names: smoke runs doctor plus the default fast path at min scale, coverage runs every workload at min scale, max runs every workload at max scale, and release runs every workload at both min and max scale. Each validation writes run reports and grading summaries under stable directories such as submissions/validation/min-default, plus one top-level validation JSON/HTML/CSV summary.

Public result status is separate from profile and suite. score-bearing workloads have real-data quality targets and can carry public quality-plus-performance results. performance-bearing workloads have standardized functional checks and can carry public performance results. systems-only workloads are still useful for architecture, kernel, backend, quantization, pruning, LoRA, distributed, or agent studies, but should not be advertised as public scores. Run mlperf audit for the local development contract that students and instructors should see as clean. Run mlperf audit --policy public for endorsement/release review; that stricter policy fails on unresolved public-release warnings such as dataset terms that require maintainer or MLCommons approval. mlperf grade and mlperf validate remain local execution and quality checks, so they do not fail or warn on public-release policy decisions.

Observed local validation runtimes on an Apple Silicon laptop are:

Validation	What it checks	Observed Runtime
`smoke`	Doctor plus default 12-workload `min` run	11.9 s
`coverage`	All 30 registered `min` manifests	24.8 s
`max`	All 30 registered `max` manifests	95.3 s
`release`	All 30 `min` plus all 30 `max` manifests	115.9 s

The validation summaries persist duration_seconds at the top level and for each suite item. They also embed a workload breakdown in the validation JSON/HTML and write mlperf_validate_workloads_<preset>_<timestamp>.csv, so instructors can track local machine drift, identify bottleneck workloads, and decide whether a run belongs in setup, lab, or release validation. Validation summaries also carry local grading status, and workload CSV rows preserve canonical selectors, dataset terms, and shared checkpoint dependencies.

The default min profile path is the fast 12-workload starter run used for setup confidence. mlperf validate coverage runs every registered workload at min scale. All runs automatically write a timestamped JSON report plus paired HTML and CSV summaries. Use --open-report to open the HTML report in the default browser, or convert any workload or suite report with mlperf report <path-or-run-directory> --format json|csv|html. Use --power to add aggregate estimated watts and joules to the reports without requiring privileged hardware counters. Verified manifests can be bundled with mlperf package, and mlperf grade scans a submissions directory with the same provenance verifier used by the standalone verify command.

Every registered workload now has a min runner. The goal of min is functional confidence: imports, model construction, a tiny deterministic forward/train/decode loop, report export, provenance, and grading should all work locally before instructors or researchers scale to max and pro.

The max profile path is the comparable full-suite run intended for assignments, artifact evaluation, and paper baselines. It currently runs all 30 registered workloads at max scale. Each score-bearing run emits a report, checkpoint where applicable, and verifiable provenance manifest. Systems-only max workloads use deterministic micro-shards until their real-data quality checks are promoted. The pro profile has a conservative default path that repeats the matching max runner, records sub-run evidence hashes, and can be scaled with MLPERF_EDU_PRO_REPETITIONS. Larger pro-only sweeps are being wired behind the contract in SPEC.md.

The SLM suite is exposed as smollm2-chat-inference with variants such as baseline, quantized-int8, batched-b4, and long-context. The public CLI, CSV, JSON, and HTML reports show those canonical selectors; internal runner IDs remain metadata for compatibility and debugging. The min profiles use a deterministic tiny local model for setup validation; max defaults to HuggingFaceTB/SmolLM2-135M-Instruct; --model can select aliases such as smollm2-135m, qwen2.5-0.5b, or qwen3-0.6b.

The vision suite now has max coverage for ResNet-18, MobileNetV2, and the MobileNet compression-composition workload. Tensor-shard overrides keep tests fast, while normal score-bearing runs use MIT-licensed Fashion-MNIST.

The recommender suite now has max coverage for both DLRM memory-system variants: micro-dlrm-dram-train uses real MovieLens data with a scalable hashed virtual embedding table, and micro-dlrm-distributed validates localhost Gloo DDP against a gradient-accumulation baseline.

The tiny suite now has max smoke coverage for DS-CNN keyword spotting and visual wake-word models using synthetic micro-shards. These validate training, checkpointing, reports, and provenance without requiring Speech Commands, Wake Vision, or torchaudio during setup validation.

The agent suite now has max coverage for RAG retrieval/generation, iterative code generation, ReAct-style tool use, and structured tool calling. These runs use deterministic local PyTorch models and synthetic prompts so students can profile systems costs without external APIs.

Research-oriented workloads now have min and max coverage for MoE, diffusion, GNN, BERT, LSTM, RL, LoRA fine-tuning, fp32/fp16 NanoGPT decode, and speculative decoding. The current max path is a deterministic micro-shard systems measurement; real-data quality checks should replace those micro-shards one workload at a time.

Benchmark Suite

Note

Source of truth — registry/suites/... and mlperf list define the executable registry. workloads.yaml is a generated compatibility mirror. This table is a human-readable catalog of the major workload families.

Suite	Task	Model	Params	Dataset	Quality target
language	Training	NanoGPT	11.1M	TinyShakespeare from Project Gutenberg	Loss < 2.3
language	Optimization	Nano-MoE	17.4M	TinyShakespeare from Project Gutenberg	Loss < 0.05
recommender	Training	Micro-DLRM	23K	MovieLens-100K	Acc > 0.70
vision	Training	Micro-Diff.	2.0M	CIFAR-10	MSE < 0.002
graph	Training	Micro-GCN	5.6K	Cora	Acc > 0.78
language	Training	Micro-BERT	432K	SST-2	Acc > 0.78
timeseries	Training	Micro-LSTM	51K	ETTh1	MSE < 0.13
rl	Training	Micro-RL	17K	CartPole (local)	Reward > 195
slm	Decode	SmolLM2/Qwen	135M+	Local prompts	Generated tokens >= 8
slm	Quant. Decode	SmolLM2/Qwen int8	135M+	Local prompts	Generated tokens >= 8
slm	Batched Decode	SmolLM2/Qwen	135M+	Prompt batch	Generated tokens >= 8 per request
slm	Long Context	SmolLM2/Qwen	135M+	Expanded local prompt	Generated tokens >= 8
vision	Img. Cls.	ResNet-18	11.2M	Fashion-MNIST	Top1 > 75%
vision	Mobile	MobileNetV2	2.4M	Fashion-MNIST	Top1 > 70%
tiny	KWS	DS-CNN	20K	Speech Commands v2	Top1 > 90%
tiny	Anomaly	Autoencoder	0.3M	MNIST	MSE < 0.04
tiny	Person Det.	MicroNet	8.5K	Wake Vision	Acc > 85%
agent	RAG	NanoRAG	20.1M	ReAct Traces	Retr.+Gen
agent	CodeGen	NanoCodeGen	13.7M	MBPP (20 tasks)	pass@1 > 0.15
agent	ReAct	NanoReAct	13.7M	ReAct Traces	Trace acc > 0.60
agent	ToolCall	NanoToolCall	13.7M	ReAct Traces	JSON validity + dispatch

Most local teaching models are inspectable PyTorch modules. The SLM workload uses transformers to hydrate off-the-shelf models. Training times were originally measured on Apple M1 MPS and are being re-verified as the runnable harness stabilizes.

Project Structure

mlperf-edu/
├── paper/                      # Publication source (LaTeX)
│   ├── paper.tex               # Main paper
│   ├── refs.bib                # Bibliography
│   └── figures/                # TikZ + pgfplots figures
├── reference/                  # Reference implementations
│   ├── cloud/                  # NanoGPT, MoE, DLRM, Diffusion, GNN, BERT, LSTM, RL, Agents
│   ├── edge/                   # ResNet-18, MobileNetV2  (fully local)
│   ├── tiny/                   # DS-CNN, Autoencoder, MicroNet
│   ├── dataset_factory.py      # Unified data loading (deterministic, seed=42)
│   └── agent_datasets.py       # MBPP + ReAct trace datasets
├── src/mlperf/                 # Core harness
│   ├── edu_cli.py              # mlperf CLI entry point
│   ├── loadgen.py              # LoadGen proxy (Offline/Server/SingleStream/MultiStream)
│   ├── power.py                # Power profiler (powermetrics / nvidia-smi)
│   └── sut.py                  # System Under Test interface
├── scripts/
│   └── compliance_checker.py   # Quality target validation
├── examples/                   # Student lab exercises
│   ├── lab1_optimization.py    # Systems optimization challenge
│   ├── lab2_inference_sut.py   # Inference SUT plugin
│   └── lab3_arch_comparison.py # Dense vs. sparse architectures
├── registry/                   # Native suite/workload/variant registry source
├── workloads.yaml              # Generated compatibility mirror
└── data/                       # Local datasets (TinyShakespeare, MBPP, etc.)

Lab Exercises

Lab 1: Systems Optimization Challenge

Students receive a "broken baseline" ResNet-18 (batch_size=8, no workers, no schedule, no augmentation) and must reach >50% accuracy within a 30-second wall-clock budget.

python examples/lab1_optimization.py

Lab 2: Inference Latency Optimization

Students implement a System Under Test (SUT) plugin for NanoGPT inference. Optimize with KV-cache, torch.compile(), or FP16 while the LoadGen measures p90 latency.

mlperf run --workload nanogpt-inference --variant decode --profile min

Lab 3: Architecture Comparison

Students train NanoGPT (dense) and Nano-MoE (sparse) side-by-side, comparing convergence, memory, and throughput.

python examples/lab3_arch_comparison.py

How It Works

Students are "submitters." They modify model code, training loops, or inference pipelines. The harness measures everything:

Train → Quality target validation (loss/accuracy thresholds)
Infer → LoadGen proxy generates Poisson/bulk arrivals, measures latency percentiles
Profile → Power measurement via powermetrics (macOS) or nvidia-smi (Linux)
Submit → JSON artifact with hardware fingerprint, metrics, and SHA-256 hash
Check → Compliance checker validates quality, parameter counts, convergence bounds

Dataset Strategy

Strategy	Datasets	Download
Bundled/local	Prompt fixtures, synthetic micro-shards, local traces, and small zero-network setup assets	0 B
Public upstream	Project Gutenberg TinyShakespeare recipe, Fashion-MNIST, MNIST, Hugging Face model weights, and other public assets recorded in dossiers	On fetch or first run
Restricted/review	MovieLens-100K, Speech Commands v2, Wake Vision, optional CIFAR experiments, and any dataset whose public redistribution policy still needs owner or MLCommons review	Policy-dependent

Each asset has source, license, public-release status, cache behavior, and next-step metadata. Synthetic or micro-sharded data is labeled and is not treated as a public score.

Requirements

Python 3.10+
uv for the recommended install path
PyTorch 2.0+
torchvision (for Fashion-MNIST, MNIST, and optional CIFAR experiments)
transformers (for SLM workloads)
Optional: torchaudio for full Speech Commands experiments

uv sync --extra dev
uv run mlperf doctor

For Apple Silicon: set PYTORCH_ENABLE_MPS_FALLBACK=1 for full MPS compatibility.

Citation

@inproceedings{mlperfedu2026,
  title={{MLPerf EDU}: Bridging Industry Benchmarking and {ML} Systems Education},
  author={[Authors]},
  year={2026}
}

Built for Machine Learning Systems education.