mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-08 02:28:25 -05:00

Files

Vijay Janapa Reddi c3921491e8 chore(bib): fix paper-subproject wrong-paper keys and corrupt entries

Round 2 of the bib audit, covering paper subprojects (mlsysim,
tinytorch, periodic-table, mlperf-edu) that the textbook-focused first
pass deferred. Same pattern as round 1: surname/year prefixes did not
match the entry's actual paper, plus several corrupt entries from
Crossref misidentification.

Renames:
- mlsysim/{docs,paper}: barrett2024 -> zheng2024sglang (SGLang paper,
  Zheng is first author).
- mlsysim/paper: zhao2025 -> deepseek2025v3 (DeepSeek-V3 ISCA paper,
  corporate author DeepSeek-AI).
- tinytorch: key499f5624 -> tanenbaum1987os (hash-fallback for
  Tanenbaum OS textbook); fry1985 -> abelson1996sicp (SICP 2nd ed,
  Fry is not in author list); wooster1982 -> papert1980mindstorms
  (Mindstorms by Papert, Wooster not in author list); collins2018 ->
  collins1989apprenticeship (Cognitive Apprenticeship paper is 1989).
- tinytorch + periodic-table: vaswani2025 -> vaswani2017attention
  (Attention paper is 2017; entries had a corrupt publisher and bogus
  DOI from Crossref misidentification).

Body fixes accompanying renames:
- tanenbaum1987os, abelson1996sicp, papert1980mindstorms: rebuilt as
  @book entries (were @article with stale review/journal DOIs).
- vaswani2017attention: rebuilt with canonical NeurIPS 2017 metadata
  (Curran Associates, vol 30, pp 5998-6008); dropped corrupt DOI.

Orphan deletions:
- tinytorch keybe9561f4 (hash-fallback, no cite sites).
- mlperf-edu vaswani2017attention (orphan).

21 cite-site updates across 4 paper subprojects. bib_lint reports 0
errors across all 5 modified bibs.

2026-05-05 20:21:04 -04:00

bench

mlperf-edu: sync iter-5.5 (integration sweep)

2026-04-16 15:31:44 -04:00

examples

feat: import mlperf-edu pedagogical benchmark suite

2026-04-16 14:15:05 -04:00

labs/data_quality

mlperf-edu: sync iters 7-10 (LoRA + compression + cost+DQ + distributed)

2026-04-16 18:28:49 -04:00

paper

chore(bib): fix paper-subproject wrong-paper keys and corrupt entries

2026-05-05 20:21:04 -04:00

reference

mlperf-edu: sync iters 7-10 (LoRA + compression + cost+DQ + distributed)

2026-04-16 18:28:49 -04:00

scripts

fix(mlperf-edu): use script-relative paths for paper figure outputs

2026-04-20 14:52:49 -04:00

src/mlperf

mlperf-edu: sync iter-5.6 (bulk regime measurement + YAML sync)

2026-04-16 17:07:03 -04:00

tests

feat: import mlperf-edu pedagogical benchmark suite

2026-04-16 14:15:05 -04:00

tools

mlperf-edu: sync iters 7-10 (LoRA + compression + cost+DQ + distributed)

2026-04-16 18:28:49 -04:00

.gitignore

mlperf-edu: sync iters 7-10 (LoRA + compression + cost+DQ + distributed)

2026-04-16 18:28:49 -04:00

datasets.yaml

feat: import mlperf-edu pedagogical benchmark suite

2026-04-16 14:15:05 -04:00

DESIGN_PHILOSOPHY.md

feat: import mlperf-edu pedagogical benchmark suite

2026-04-16 14:15:05 -04:00

pyproject.toml

feat: import mlperf-edu pedagogical benchmark suite

2026-04-16 14:15:05 -04:00

README.md

docs: clarify MLSysBook ecosystem paths

2026-04-25 08:48:38 -04:00

workloads.yaml

mlperf-edu: sync iters 7-10 (LoRA + compression + cost+DQ + distributed)

2026-04-16 18:28:49 -04:00

README.md

Warning

🚧 Under construction

This tree is not polished end-to-end yet: APIs, CLI flags, workload manifests, and documentation are still being wired for classroom use. Do not rely on it for production benchmarking — expect breaking changes until we publish a stable "1.0" teaching release.

Note

📌 Early work (2026)

MLPerf EDU is being developed in public alongside the 2026 MLSysBook ecosystem. Harness scripts, compliance checks, and teaching notes will keep moving as we align workloads with the core curriculum.

Feedback — GitHub issues or pull requests (especially if something in this README is wrong or outdated).

MLPerf EDU 🎓

A 16-workload pedagogical ML systems benchmark suite aligned with MLCommons MLPerf.

MLPerf EDU brings industry-standard ML benchmarking to the classroom. Every model is a self-contained, white-box PyTorch nn.Module — no torchvision.models, no HuggingFace model cards, no opaque C++ bindings. Students read, modify, and optimize every layer.

📄 Paper: See paper/paper.tex — "MLPerf EDU: Bridging Industry Benchmarking and ML Systems Education"

Quick Start

# Clone and install
git clone https://github.com/harvard-edge/cs249r_book.git
cd cs249r_book/mlperf-edu
pip install -e .

# Train a single workload (25 epochs, ~89 seconds)
mlperf cloud --task nanogpt-12m

# Train ALL 16 workloads (~8 minutes)
mlperf train --all

# Run inference with a student SUT plugin
mlperf cloud --task nanogpt-12m \
    --sut my_optimized_sut.py \
    --scenario Server --division closed

# Check compliance
python scripts/compliance_checker.py \
    --workload nanogpt --log results.json

Benchmark Suite (16 Workloads)

Note

Source of truth — Row counts and targets stay in sync with workloads.yaml. When you change a workload, update the YAML; this table is regenerated from it.

Division	Task	Model	Params	Dataset	Quality target
Cloud	Language	NanoGPT	11.1M	TinyShakespeare (char)	Loss < 2.3
Cloud	Sparse MoE	Nano-MoE	17.4M	TinyShakespeare (char)	Loss < 0.05
Cloud	Rec.	Micro-DLRM	23K	MovieLens-100K	Acc > 0.70
Cloud	Generation	Micro-Diff.	2.0M	CIFAR-10	MSE < 0.002
Cloud	Graph	Micro-GCN	5.6K	Cora	Acc > 0.78
Cloud	Text Cls.	Micro-BERT	432K	SST-2	Acc > 0.78
Cloud	Time Series	Micro-LSTM	51K	ETTh1	MSE < 0.13
Cloud	RL	Micro-RL	17K	CartPole (local)	Reward > 195
Edge	Img. Cls.	ResNet-18	11.2M	CIFAR-100	Top1 > 36%
Edge	Mobile	MobileNetV2	2.4M	CIFAR-100	Top1 > 40%
Tiny	KWS	DS-CNN	20K	Speech Commands v2	Top1 > 90%
Tiny	Anomaly	Autoencoder	0.3M	MNIST	MSE < 0.04
Tiny	Person Det.	MicroNet	8.5K	Wake Vision	Acc > 85%
Agent	RAG	NanoRAG	20.1M	ReAct Traces	Retr.+Gen
Agent	CodeGen	NanoCodeGen	13.7M	MBPP (20 tasks)	pass@1 > 0.15
Agent	ReAct	NanoReAct	13.7M	ReAct Traces	Trace acc > 0.60

All models are pure PyTorch. All training times measured on Apple M1 MPS. Total supervised suite: ~9 minutes.

Project Structure

mlperf-edu/
├── paper/                      # Publication source (LaTeX)
│   ├── paper.tex               # Main paper
│   ├── refs.bib                # Bibliography
│   └── figures/                # TikZ + pgfplots figures
├── reference/                  # Reference implementations
│   ├── cloud/                  # NanoGPT, MoE, DLRM, Diffusion, GNN, BERT, LSTM, RL, Agents
│   ├── edge/                   # ResNet-18, MobileNetV2  (fully local)
│   ├── tiny/                   # DS-CNN, Autoencoder, MicroNet
│   ├── dataset_factory.py      # Unified data loading (deterministic, seed=42)
│   └── agent_datasets.py       # MBPP + ReAct trace datasets
├── src/mlperf/                 # Core harness
│   ├── cli.py                  # CLI entry point
│   ├── loadgen.py              # LoadGen proxy (Offline/Server/SingleStream/MultiStream)
│   ├── power.py                # Power profiler (powermetrics / nvidia-smi)
│   └── sut.py                  # System Under Test interface
├── scripts/
│   └── compliance_checker.py   # Quality target validation
├── examples/                   # Student lab exercises
│   ├── lab1_optimization.py    # Systems optimization challenge
│   ├── lab2_inference_sut.py   # Inference SUT plugin
│   └── lab3_arch_comparison.py # Dense vs. sparse architectures
├── workloads.yaml              # Workload registry (single source of truth)
└── data/                       # Local datasets (TinyShakespeare, MBPP, etc.)

Lab Exercises

Lab 1: Systems Optimization Challenge

Students receive a "broken baseline" ResNet-18 (batch_size=8, no workers, no schedule, no augmentation) and must reach >50% accuracy within a 30-second wall-clock budget.

python examples/lab1_optimization.py

Lab 2: Inference Latency Optimization

Students implement a System Under Test (SUT) plugin for NanoGPT inference. Optimize with KV-cache, torch.compile(), or FP16 while the LoadGen measures p90 latency.

mlperf cloud --task nanogpt-12m --sut examples/lab2_inference_sut.py --scenario SingleStream

Lab 3: Architecture Comparison

Students train NanoGPT (dense) and Nano-MoE (sparse) side-by-side, comparing convergence, memory, and throughput.

python examples/lab3_arch_comparison.py

How It Works

Students are "submitters." They modify model code, training loops, or inference pipelines. The harness measures everything:

Train → Quality target validation (loss/accuracy thresholds)
Infer → LoadGen proxy generates Poisson/bulk arrivals, measures latency percentiles
Profile → Power measurement via powermetrics (macOS) or nvidia-smi (Linux)
Submit → JSON artifact with hardware fingerprint, metrics, and SHA-256 hash
Check → Compliance checker validates quality, parameter counts, convergence bounds

Dataset Strategy

Strategy	Datasets	Download
Shipped with repo	TinyShakespeare, MBPP, ReAct Traces	0 B
Deterministic synthetic	GCN, BERT, LSTM, DLRM, CartPole, RL	0 B
Auto-download	CIFAR-10/100, MNIST, Speech Commands v2, Wake Vision	On first run

8 of 13 datasets require zero network access. All use seed=42 for reproducibility.

Requirements

Python 3.9+
PyTorch 2.0+
torchvision (for CIFAR/MNIST)
torchaudio (for Speech Commands)

pip install torch torchvision torchaudio
pip install -e .

For Apple Silicon: set PYTORCH_ENABLE_MPS_FALLBACK=1 for full MPS compatibility.

Citation

@inproceedings{mlperfedu2026,
  title={{MLPerf EDU}: Bridging Industry Benchmarking and {ML} Systems Education},
  author={[Authors]},
  year={2026}
}

Built for Machine Learning Systems education.