mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 18:18:42 -05:00
* docs(mlsysim): release-prep audit fixes for 0.1.0
Fixes the broken links, stale numerical claims, and naming inconsistencies
surfaced by the 0.1.0 release-prep review. Output of the docs site now matches
what the engine actually computes, internal navigation has no unresolved targets,
and the Hatch announcement banner uses an absolute URL so sub-pages render the
"Get started" link correctly.
Notable changes:
- Hero example on docs/index.qmd and getting-started.qmd now reflect the actual
Engine.solve(ResNet50, A100, bs=1, fp16) output (Memory / 0.54 ms / 1843).
- Update Python version requirement (3.10+) and document the editable-install
limitation (Hatch sources rewrite is not supported by editables).
- Standardize the typographic brand to "MLSys·im" in the navbar, OG/Twitter
metadata, and the shared cross-site dropdown.
- Add the four solvers missing from the quartodoc list
(BatchingOptimizer, ForwardModel, NetworkRooflineModel, PlacementOptimizer)
and surface the orphan tutorials (01_pipeline_callbacks,
02_differential_explainer, 12_design_space_exploration) in the sidebar.
- Rename every reference to the now-deleted hello_world / llm_serving /
sustainability / 11_full_stack_audit tutorials to their current filenames.
- Add the missing @mlsysbook2024 entry to references.bib so whitepaper.qmd
no longer logs a citeproc warning.
- Fix the CLI sample on the parent site/index.qmd card to use real model
identifiers (Llama3_70B H100 --batch-size 1).
- Soften the Colab/Binder copy until launch buttons are wired in.
- Remove the duplicate "Differential Explainer" card on tutorials/index.qmd.
* release(mlsysim): add 0.1.0 release notes and runbook
- RELEASE_NOTES_0.1.0.md: GitHub-release-ready notes promoted from CHANGELOG
with install/quickstart copy and a "known limitations & gotchas" section
covering the editable-install issue, broken example scripts, and unpublished
slide tag.
- RELEASE.md: copy-pasteable runbook for cutting a release (pre-flight check,
tag, build, twine upload, docs deploy via workflow_dispatch, GitHub release,
and post-release verification).
- CHANGELOG.md: corrected the test count from 334 to the actual 367 currently
passing on dev.
* mlsysim: nest package layout, enable editable installs, clean lint
Restructure mlsysim into the standard nested layout (`mlsysim/mlsysim/...`)
so `pip install -e .` works out of the box. The previous flat layout used
a Hatch `sources = {"." = "mlsysim"}` prefix-add rewrite that the
`editables` backend cannot handle, breaking editable installs entirely.
Packaging
- pyproject.toml: drop `sources` rewrite, set `packages = ["mlsysim"]`,
add explicit `[tool.hatch.build.targets.sdist]` include list.
- Wheel and sdist now contain only the package and project metadata
(no `tests/`, `docs/`, `examples/`, `paper/`, `vscode-ext/` leakage).
- Update `pyright.exclude` for nested layout.
- Update GitHub source links in `docs/math.qmd` and
`docs/models-and-solvers.qmd` to point to `mlsysim/mlsysim/...`.
Lint configuration
- Add `[tool.ruff]` to pyproject.toml with sensible per-file ignores:
`__init__.py` re-export pattern (F401/F403/F405/F811),
`core/constants.py` star import from unit registry,
tests/examples idioms.
- `ruff check .` reports zero issues (down from 621).
Real bug fixes uncovered by lint cleanup
- `core/solver.py`: remove unused `from pydantic import BaseModel` that
was being shadowed by the local `BaseModel = ForwardModel` alias.
- `sim/simulations.py`: remove redundant local `Fleet` import that was
shadowing the module-level import and triggering F823 (referenced
before assignment) on the earlier `isinstance(..., Fleet)` check.
- `cli/commands/audit.py`, `cli/commands/eval.py`: narrow three bare
`except:` clauses to specific exception types.
- `tests/test_sota.py`: add the missing speculative-decoding ITL
assertion (`res_opt.itl < res_base.itl`) — `res_base` was previously
computed but never compared.
- `cli/commands/eval.py`: drop unused `is_json` local.
- `labs/components.py`: drop unused `energy` placeholder local.
Examples
- `examples/06_multi_objective_pareto.py`: rewrite around the actual
`BatchingOptimizerResult` API (which has no `pareto_front` attribute);
build the front explicitly by sweeping batch sizes through
`ServingModel` + `TailLatencyModel`, then highlight the optimum
returned by `BatchingOptimizer`.
- `examples/gemini_design_loop.py`: fix multi-line f-string syntax errors
(`f"\n[…]"` instead of an embedded literal newline) so the file imports
on every supported Python version.
Dev scripts
- `generate_appendix.py` and `paper/scripts/validate_anchors.py`: switch
from package-relative imports to absolute `from mlsysim... import` so
they run cleanly under the nested layout.
Docs / release notes
- `docs/getting-started.qmd`: replace the editable-install caveat with
`pip install -e ".[dev]"` (now supported).
- `RELEASE_NOTES_0.1.0.md`: drop the three "known limitations" entries
that this commit resolves (editable install, pareto example, gemini
example).
- `CHANGELOG.md`: add a "Packaging & Tooling" section describing the
layout change and the resolver bug fixes.
Verification
- `python -m pytest tests/` → 367 passed (was 367, no regressions).
- `ruff check .` → All checks passed.
- `pip install -e .` → succeeds; live source picked up.
- Fresh-venv wheel install + CLI smoke test → succeeds.
- `examples/06_multi_objective_pareto.py` and
`examples/gemini_design_loop.py` → both exit 0.
* fix(mlsysim): repair docs build + lab test after nested-package restructure
The 0.1.0 release prep moved the package from `mlsysim/` to `mlsysim/mlsysim/`
to support `pip install -e .`. Two CI jobs still depended on the old layout:
1. **Docs build (`mlsysim-preview-dev`)** — every tutorial and zoo page used
a hand-rolled `importlib.util.spec_from_file_location` block to load
`<repo>/mlsysim/__init__.py` directly from source. After the restructure,
that path no longer exists. Replaced the hack in 17 docs/.qmd files with
a plain `import mlsysim` — the package is already pip-installed in the
docs build environment via `pip install ".[docs]"`. Updated the matching
guidance in `contributing.qmd`.
2. **Lab static tests** — `test_no_localstorage_import` hard-coded
`mlsysim/labs/state.py`; updated to the new nested path
`mlsysim/mlsysim/labs/state.py`.
Verified locally: `pytest labs/tests/test_static.py::TestStateImplementation`
passes, and `quarto render docs/zoo/models.qmd` succeeds end-to-end.
135 lines
4.7 KiB
Python
135 lines
4.7 KiB
Python
"""
|
|
Agentic Infrastructure Design Loop (Conceptual Implementation)
|
|
============================================================
|
|
Vision: "AI designing AI infrastructure."
|
|
|
|
This script demonstrates how an advanced multi-agent system (e.g., powered
|
|
by multiple Gemini-capable models like gemini-3-pro-preview) can use MLSys·im
|
|
to autonomously design, debate, and refine a datacenter cluster.
|
|
|
|
We simulate two agents:
|
|
1. The "Architect": Generates cluster configurations (YAML) to meet an SLA.
|
|
2. The "Critic/Evaluator": Runs MLSys·im, reads the physics output, and points out
|
|
bottlenecks (e.g., "We hit the memory wall here, increase batch size or nodes.")
|
|
|
|
This is the exact loop that makes MLSys·im the de facto standard: it's not just a
|
|
calculator for humans; it's a physics engine for autonomous AI engineers.
|
|
"""
|
|
|
|
import os
|
|
import yaml
|
|
import json
|
|
import time
|
|
|
|
# In a real environment, this would be google.generativeai or similar.
|
|
# import google.generativeai as genai
|
|
|
|
# We mock the LLM responses for the sake of the reproducible example in the repo.
|
|
class MockGeminiAgent:
|
|
def __init__(self, role: str):
|
|
self.role = role
|
|
self.history = []
|
|
|
|
def prompt(self, text: str, tools=None) -> str:
|
|
"""Simulates calling a frontier Gemini model."""
|
|
print(f"\n[{self.role.upper()} AGENT] Thinking...")
|
|
time.sleep(1)
|
|
|
|
if "Initial Request" in text:
|
|
return """
|
|
version: "1.0"
|
|
name: "Llama3 70B First Attempt"
|
|
workload:
|
|
name: "Llama3_70B"
|
|
batch_size: 256
|
|
hardware:
|
|
name: "H100"
|
|
nodes: 1
|
|
ops:
|
|
region: "US_Avg"
|
|
duration_days: 30.0
|
|
"""
|
|
elif "FAIL" in text and "OOM" in text:
|
|
print(f"[{self.role.upper()} AGENT] Noticed Memory Wall failure. Adjusting parallel nodes.")
|
|
return """
|
|
version: "1.0"
|
|
name: "Llama3 70B Distributed Attempt"
|
|
workload:
|
|
name: "Llama3_70B"
|
|
batch_size: 256
|
|
hardware:
|
|
name: "H100"
|
|
nodes: 8
|
|
ops:
|
|
region: "Quebec"
|
|
duration_days: 30.0
|
|
"""
|
|
return "Task Complete."
|
|
|
|
|
|
def run_agentic_loop():
|
|
from mlsysim.cli.schemas import MlsysPlanSchema
|
|
from mlsysim.core.evaluation import SystemEvaluator
|
|
|
|
print("==================================================")
|
|
print("🚀 INITIALIZING MLSYS·IM AGENTIC DESIGN LOOP")
|
|
print("==================================================")
|
|
|
|
architect = MockGeminiAgent(role="Architect")
|
|
|
|
# The Goal SLA
|
|
goal = "Design a cluster to serve Llama3_70B. Keep it under 10 nodes if possible. Minimize carbon."
|
|
print(f"\n[USER] Goal: {goal}")
|
|
|
|
iteration = 1
|
|
max_iterations = 3
|
|
current_prompt = f"Initial Request: {goal}. Output ONLY the YAML."
|
|
|
|
while iteration <= max_iterations:
|
|
print(f"\n--- Iteration {iteration} ---")
|
|
|
|
# 1. Agent generates YAML
|
|
yaml_str = architect.prompt(current_prompt).strip()
|
|
print("Proposed Architecture YAML:")
|
|
print(yaml_str)
|
|
|
|
# 2. Execute against MLSys·im Physics Engine
|
|
raw_data = yaml.safe_load(yaml_str)
|
|
try:
|
|
schema = MlsysPlanSchema(**raw_data)
|
|
eval_obj = SystemEvaluator.evaluate(
|
|
scenario_name=schema.name,
|
|
model_obj=schema.model_obj,
|
|
hardware_obj=schema.hardware_obj,
|
|
batch_size=schema.workload.batch_size,
|
|
precision=schema.hardware.precision,
|
|
efficiency=schema.hardware.efficiency,
|
|
fleet_obj=schema.fleet_obj,
|
|
nodes=schema.hardware.nodes,
|
|
duration_days=schema.ops.duration_days
|
|
)
|
|
|
|
result_dict = eval_obj.to_dict()
|
|
|
|
# 3. Analyze output (The Critic)
|
|
if result_dict["f_status"] == "FAIL":
|
|
feedback = f"Feasibility FAIL. Summary: {eval_obj.feasibility.summary}. Please fix the OOM issue."
|
|
print(f"[ENVIRONMENT] ❌ {feedback}")
|
|
current_prompt = f"Previous YAML failed: {feedback}. Output a new corrected YAML."
|
|
else:
|
|
print("[ENVIRONMENT] ✅ Design is physically feasible.")
|
|
print(f" Throughput: {result_dict.get('p_throughput', 'N/A')}")
|
|
print(f" TCO ($): ${result_dict.get('m_tco_usd', 0):,.2f}")
|
|
print(f" Carbon: {result_dict.get('m_carbon_footprint', 0):.2f} tonnes")
|
|
print("\n[SUCCESS] Agent reached optimal configuration.")
|
|
break
|
|
|
|
except Exception as e:
|
|
print(f"[ENVIRONMENT] ❌ Crash evaluating YAML: {e}")
|
|
current_prompt = f"YAML parsing or execution failed with error: {e}. Fix the schema."
|
|
|
|
iteration += 1
|
|
|
|
if __name__ == "__main__":
|
|
run_agentic_loop()
|