mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-06 09:38:33 -05:00

Files

dependabot[bot] 0bce771e26 deps(mlsysim): update rich requirement in /mlsysim (#1642 )

Updates the requirements on [rich](https://github.com/Textualize/rich) to permit the latest version.
- [Release notes](https://github.com/Textualize/rich/releases)
- [Changelog](https://github.com/Textualize/rich/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Textualize/rich/compare/v13.0.0...v15.0.0)

---
updated-dependencies:
- dependency-name: rich
  dependency-version: 15.0.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

2026-05-04 07:39:42 -04:00

docs

fix(refs): apply 54 bib audit fixes from verification pass

2026-05-03 13:15:53 -04:00

examples

fix(mlsysim): harden release QA and paper artifacts

2026-04-25 10:06:01 -04:00

mlsysim

fix(mlsysim): correct unit conversion in calc_monthly_egress_cost (#1597 )

2026-04-29 10:13:42 -04:00

notebooks

fix(mlsysim): harden release QA and paper artifacts

2026-04-25 10:06:01 -04:00

paper

fix(refs): round 3 phase 1a+1b — 107 cited bib fixes

2026-05-03 14:27:29 -04:00

tests

fix(mlsysim): skip viz test when matplotlib is not installed (#1608 )

2026-04-30 19:03:54 -04:00

tutorial

fix(mlsysim): publish tutorial slides alongside the docs site

2026-04-27 17:27:48 -04:00

vscode-ext

deps(mlsysim-ext): bump @types/vscode in /mlsysim/vscode-ext (#1648 )

2026-05-04 07:18:47 -04:00

.all-contributorsrc

docs: add @Shashank-Tripathi-07 as contributor for bug, code (mlsysim)

2026-04-30 22:47:11 +00:00

CHANGELOG.md

…

CITATION.cff

…

generate_appendix.py

…

LICENSE.md

…

Makefile

…

pyproject.toml

deps(mlsysim): update rich requirement in /mlsysim (#1642 )

2026-05-04 07:39:42 -04:00

pytest.ini

…

README.md

docs: add @Shashank-Tripathi-07 as contributor for bug, code (mlsysim)

2026-04-30 22:47:11 +00:00

RELEASE_NOTES_0.1.0.md

…

RELEASE.md

…

README.md

Note

📌 Early release (2026)

MLSys·im shipped with the 2026 MLSysBook refresh. The modeling platform, APIs, and lab integrations are actively iterated as we harden the simulator and teaching workflows.

Feedback — GitHub issues or pull requests.

🚀 MLSys·im: The Modeling Platform

The physics-grounded analytical simulator powering the Machine Learning Systems ecosystem.
Provides a unified "Single Source of Truth" (SSoT) for modeling systems from sub-watt microcontrollers to exaflop-scale global fleets.

🏗 The 5-Layer Analytical Stack

mlsysim implements a "Progressive Lowering" architecture, separating high-level workloads from the physical infrastructure that executes them.

Layer	Domain	Key Components
Layer A	Workload Representation `mlsysim.models`	FLOPs, parameters, and intensity. e.g., Llama3_70B, ResNet50
Layer B	Hardware Registry `mlsysim.hardware`	Concrete specs for real-world silicon. e.g., H100, TPUv5p, Jetson
Layer C	Infrastructure `mlsysim.infra`	Grid profiles and datacenter sustainability. e.g., PUE, Carbon Intensity, WUE
Layer D	Systems & Topology `mlsysim.systems`	Fleet configurations and network fabrics. e.g., Doorbell, AutoDrive Scenarios
Layer E	Execution & Resolvers `mlsysim.core.solver`	The 3-tier math engine: Models, Solvers, and Optimizers (Design space search).

🚀 Quick Usage: The Agent-Ready CLI

mlsysim is a first-principles analytical calculator for ML systems. It provides a terminal UI for humans and a strict JSON API for CI/CD pipelines and AI agents.

Accuracy note: mlsysim predictions are typically within 2–5× of measured performance for well-characterized workloads. For production capacity planning, always validate with benchmarks. This tool formalizes the back-of-envelope math that senior engineers do intuitively — it is not a substitute for profiling or load testing.

1. Explore the Registry (The Zoo)

Discover built-in hardware, models, and infrastructure without reading source code: mlsysim zoo hardware
mlsysim zoo models

2. Quick Evaluation (CLI Flags)

Evaluate the physics of a workload on a specific hardware node instantly: mlsysim eval Llama3_8B H100 --batch-size 32

3. Deep Simulation (Infrastructure as Code)

Define your entire cluster and SLA constraints in a declarative mlsys.yaml file:

# example_cluster.yaml
version: "1.0"
name: "Llama-3 70B training audit"
workload:
  name: "Llama3_70B"
  batch_size: 4096
hardware:
  name: "H100"
  nodes: 64
ops:
  region: "Quebec"
  duration_days: 14.0
constraints:
  assert:
    - metric: "performance.latency"
      max: 50.0

Then compile and evaluate the 3-lens scorecard (Feasibility, Performance, Macro): mlsysim eval example_cluster.yaml

4. CI/CD & Agentic Automation

Every command supports strict, schema-validated JSON output. If an assert constraint is violated, the CLI returns a semantic Exit Code 3.

# Export the JSON Schema for your IDE or AI Agent
mlsysim schema > schema.json

# Run an evaluation in a CI pipeline
tco=$(mlsysim --output json eval example_cluster.yaml | jq .m_tco_usd)

5. Design Space Search (Optimizers)

Use the Tier 3 Engineering Engine to automatically find the optimal configuration: mlsysim optimize parallelism example_cluster.yaml
mlsysim optimize placement example_cluster.yaml --carbon-tax 150

🛡 Stability & Integrity

Because this core powers a printed textbook, we enforce strict Invariant Verification. Every physical constant is traceable to a primary source (datasheet or paper), and dimensional integrity is enforced via pint.

⚠️ What This Tool Does Not Model

MLSysim is an analytical hardware calculator, not a production deployment simulator. The 22 walls model physical and economic constraints that bound ML system performance. Several critical production concerns are deliberately out of scope:

Concern	Why it matters	Where to learn more
Data drift / distribution shift	The #1 cause of production ML failures — model accuracy degrades silently as input distributions change	Sculley et al. (2015), "Hidden Technical Debt in ML Systems"
Model versioning & rollback	Production requires running multiple versions, A/B testing, and safe rollback	Huyen (2022), Designing Machine Learning Systems
Monitoring & observability	You cannot manage what you cannot measure — prediction distributions, latency percentiles, error rates	Google SRE Book (2016); Huyen (2022)
Feature store freshness	Stale features silently degrade real-time models (recommendations, fraud detection)	Uber Michelangelo (2017)
Software bugs & misconfigurations	Most outages are caused by software, not hardware	Barroso et al. (2018)
Human factors	Team velocity, on-call burden, and organizational alignment often dominate outcomes	Brooks (1975), The Mythical Man-Month

Passing all 22 walls is necessary but not sufficient for a successful production deployment.

Students using this tool should understand that infrastructure physics (what mlsysim models) is one dimension of a multi-dimensional engineering challenge.

📖 How to Cite

If you use mlsysim in your research or teaching, please cite:

@software{mlsysim2026,
  author       = {Janapa Reddi, Vijay},
  title        = {{MLSys$\cdot$im}: First-Principles Infrastructure Modeling for Machine Learning Systems},
  year         = {2026},
  url          = {https://mlsysbook.ai/mlsysim},
  version      = {0.1.1},
  institution  = {Harvard University}
}

🛠 Installation

MLSys·im is designed to be highly modular. Install only what you need:

# Core physics engine only (fastest, smallest footprint)
pip install mlsysim

# The CLI and YAML support are included in the base package.
# The [cli] extra is retained as a backward-compatible no-op.
pip install "mlsysim[cli]"

# Install with dependencies for interactive labs (Marimo, Plotly)
pip install "mlsysim[labs]"

🐍 Python API Usage

The framework is just as powerful inside a Python script or Jupyter Notebook. The SystemEvaluator provides a clean, unified entry point for full-stack analysis:

import mlsysim

# 1. Define the scenario
model = mlsysim.Models.Language.Llama3_8B
hardware = mlsysim.Hardware.Cloud.H100

# 2. Run the evaluation
evaluation = mlsysim.SystemEvaluator.evaluate(
    scenario_name="Llama-3 8B on H100",
    model_obj=model,
    hardware_obj=hardware,
    batch_size=32,
    precision="fp16",
    efficiency=0.45
)

# 3. View the beautifully formatted scorecard
print(evaluation.scorecard())

Efficiency Parameter Guide

The efficiency parameter (0.0–1.0) captures the gap between peak hardware performance and what your software stack actually achieves. Use these guidelines:

Scenario	Efficiency	Rationale
Training (Megatron-LM, large Transformer)	0.40–0.55	Well-optimized GEMM + FlashAttention
Training (PyTorch eager, small model)	0.08–0.15	Kernel launch overhead dominates
Inference decode, batch=1	0.01–0.05	Memory-bound; compute nearly idle
Inference decode, batch=32+	0.15–0.35	Batch amortizes weight loading
Inference prefill, long context	0.30–0.50	Compute-bound GEMM + attention
TinyML (TFLite Micro on ESP32)	0.05–0.15	Interpreter overhead, no tensor cores

Contributors

Thanks to these wonderful people for helping improve MLSys·im!

Legend: 🪲 Bug Hunter · ⚡ Code Warrior · 📚 Documentation Hero · 🎨 Design Artist · 🧠 Idea Generator · 🔎 Code Reviewer · 🧪 Test Engineer · 🛠️ Tool Builder

_{Vijay Janapa Reddi}
🧑‍💻 🎨 ✍️ 🧠 maintenance

_{Peter Koellner}
🪲 ✍️

_Rocky
🪲 🧑‍💻

_{Zeljko Hrcek}
🧑‍💻

Recognize a contributor: Comment on any issue or PR:

@all-contributors please add @username for code, doc, ideas, or bug

License

Code: Apache License 2.0 — free for commercial and non-commercial use, with patent grant and attribution requirement.

Documentation and textbook prose: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC-BY-NC-SA-4.0) — the tutorials and prose on mlsysbook.ai/mlsysim are part of the Machine Learning Systems textbook and carry its license.

The two licenses are intentionally separate: the Python package is permissively licensed so engineers and researchers can use it anywhere (including commercially), while the textbook prose retains its non-commercial protection to prevent republication as a derivative textbook.

README.md Unescape Escape

🚀 MLSys·im: The Modeling Platform

🏗 The 5-Layer Analytical Stack

🚀 Quick Usage: The Agent-Ready CLI

1. Explore the Registry (The Zoo)

2. Quick Evaluation (CLI Flags)

3. Deep Simulation (Infrastructure as Code)

4. CI/CD & Agentic Automation

5. Design Space Search (Optimizers)

🛡 Stability & Integrity

⚠️ What This Tool Does Not Model

📖 How to Cite

🛠 Installation

🐍 Python API Usage

Efficiency Parameter Guide

Contributors

License

README.md