cs249r_book/book/quarto/contents/vol2/index.qmd

---
format:
  html:
    title: "Machine Learning Systems at Scale"
    date: today
    date-format: long
    doi: "v0.5.1"
    doi-title: "Version"
    author:
      name: Vijay Janapa Reddi
      email: vj@eecs.harvard.edu
      url: https://vijay.seas.harvard.edu
      affiliation: Harvard University
---

::: {.content-visible unless-format="html:js"}

# Author's Note {.unnumbered}

::: {style="font-style: italic;"}

When a new model is announced, the world pays attention to the model. The headlines celebrate the benchmark, the parameter count, the capability. What they do not celebrate is the engineering that made it possible. Behind every frontier model is a fleet: tens of thousands of accelerators coordinated across a network fabric where a single misconfigured switch can stall the entire training run. Cooling systems that dissipate megawatts of heat from racks so dense that air alone cannot carry it away. Checkpoint protocols that race against a clock where, in a cluster of ten thousand devices, a hardware failure arrives roughly every two hours. Scheduling systems that must keep utilization high across a machine the size of a small campus while serving workloads with sharply different resource profiles.

This engineering has no precedent. Humanity has built large distributed systems before: the telephone network, the internet, the global financial infrastructure. Yet none of them required the sustained, synchronized, high-bandwidth coordination that training a single model across thousands of accelerators demands. An AllReduce operation that synchronizes gradients across a cluster must complete in milliseconds, not seconds; a straggler that falls behind by even a fraction disrupts every other node waiting at the barrier. The tolerances are tighter, the data volumes larger, and the failure modes more subtle than anything the field of distributed systems has confronted at this scale.

This engineering is largely invisible. It is hidden behind APIs, behind cloud abstractions, behind announcements that describe what a model can do but not what it took to build it. The people who design the network topologies, who write the collective communication libraries, who architect the fault tolerance that keeps a ninety-day training run from collapsing on day forty-seven: their work is indispensable but rarely discussed. The model gets the paper. The fleet gets a footnote, if that.

This book is an argument that the fleet deserves more than a footnote. The engineering that makes large-scale ML possible is not a supporting function beneath a more visible discipline. It *is* the discipline. A model that cannot be trained is an idea, not a system. A model that cannot be served is a research artifact, not a product. A model that cannot be governed is a liability, not an asset. At every stage of the lifecycle, from training through serving, operating, and governing, the fleet determines what is possible and what is practical. The principles that govern fleet-scale systems are as deep, as quantitative, and as worthy of study as the algorithms they support.

This book makes that engineering visible. It gives it vocabulary, principles, and the quantitative rigor it deserves. If the companion volume asked "what does it take to build an ML system?", this volume asks "what does it take to build a *thousand* of them, make them work together, and govern them responsibly?" That question is, I believe, the defining engineering challenge of this generation. It deserves a discipline. This book is a step toward building one.

--- Vijay Janapa Reddi

:::

:::

::: {.content-visible when-format="html:js"}

# Welcome {.unnumbered}

```{=html}
<div class="abstract-section">
  <div class="abstract-content">
    <p>Modern machine learning operates at scales that fundamentally change engineering requirements—models too large for single GPUs, services spanning continents, deployments carrying societal responsibilities. This book addresses AI engineering at scale. The treatment follows the lifecycle of a massive-scale system: defining the distributed architecture, building the physical infrastructure fleet, ensuring operational reliability, deploying to global users, and hardening the system for safety and responsibility.</p>
  </div>

  <a href="assets/downloads/Machine-Learning-Systems-Vol2.pdf" target="_blank" class="book-card-link" title="Download PDF">
    <div class="book-card">
      <img src="../../assets/images/covers/cover-hardcover-book.png" alt="Machine Learning Systems Book Cover" class="book-image" />
      <p class="book-title">Machine Learning Systems at Scale</p>
      <p class="book-subtitle">Publisher: The MIT Press (2026)</p>
      <p style="font-size: 0.8em; color: #6c757d; margin-top: 6px; margin-bottom: 0;">📖 Click here to download PDF</p>
    </div>
  </a>
</div>
```

## What You Will Learn {.unnumbered}

This book extends the foundations into production-scale systems through four parts that follow the **Fleet Stack** from bottom to top:

- **Part I: The Fleet** — Build the physical computer. Architect the datacenter infrastructure, high-bandwidth network fabrics, and scalable data storage that form the foundation of every distributed ML deployment.
- **Part II: Distributed ML** — Master the algorithms of scale. Learn how to coordinate computation across thousands of devices using parallelism strategies, collective communication primitives, fault tolerance mechanisms, and fleet orchestration.
- **Part III: Deployment at Scale** — Serve the world. Navigate the shift from training to inference, optimize performance across the serving stack, push intelligence to the edge, and manage the operational lifecycle of production fleets.
- **Part IV: The Responsible Fleet** — Harden and govern the system. Address security, robustness, environmental sustainability, and responsible engineering in large-scale operations.

## Prerequisites {.unnumbered}

This book assumes:

- **Foundational or equivalent** background in single-machine ML systems
- **Programming proficiency** in Python with familiarity in NumPy
- **Mathematics foundations** in linear algebra, calculus, and probability
- Familiarity with distributed systems concepts (networking, parallelism) is helpful for advanced topics

## Support Our Mission {.unnumbered}

```{=html}
<div class="support-mission">
  <p><strong>2026 Goal:</strong> Help 100,000 students learn ML Systems. Sponsors like the <a href="https://edgeaifoundation.org/" target="_blank" rel="noopener noreferrer">EDGE AI Foundation</a> match every star with funding that supports learning.</p>

  <div class="support-actions">
    <span class="star-count" id="star-count">Loading...</span>
    <a href="https://github.com/harvard-edge/cs249r_book" target="_blank" rel="noopener" class="github-star-btn">⭐ Star on GitHub</a>
  </div>

  <p class="support-note">
    <a href="https://opencollective.com/mlsysbook" target="_blank" rel="noopener">Support us on Open Collective →</a>
  </p>
</div>
```

```{=html}
<script>
async function fetchGitHubStars() {
  const starElement = document.getElementById('star-count');

  try {
    const response = await fetch('https://api.github.com/repos/harvard-edge/cs249r_book');
    const data = await response.json();
    const starCount = data.stargazers_count;
    const formattedCount = starCount.toLocaleString();
    starElement.textContent = formattedCount;
    starElement.style.opacity = '1';
  } catch (error) {
    console.error('Failed to fetch GitHub stars:', error);
    starElement.textContent = 'Loading...';
    starElement.style.opacity = '1';
  }
}

document.addEventListener('DOMContentLoaded', fetchGitHubStars);
</script>
```

## Listen to the AI Podcast {.unnumbered}

```{=html}
<div class="podcast-section">
  <p>
    This short podcast, created with Google's Notebook LM and inspired by insights from our <a href="https://web.eng.fiu.edu/gaquan/Papers/ESWEEK24Papers/CPS-Proceedings/pdfs/CODES-ISSS/563900a043/563900a043.pdf" target="_blank" rel="noopener">IEEE education viewpoint paper</a>, offers an accessible overview of the book's key ideas and themes.
  </p>
  <audio controls="controls">
    <source src="../../assets/media/notebooklm_podcast_mlsysbookai.mp3" type="audio/mpeg" />
    Your browser does not support the audio element.
  </audio>
</div>
```

## Want to Help Out? {.unnumbered}

This is a collaborative project, and your input matters. If you'd like to contribute, check out our [contribution guidelines](https://github.com/harvard-edge/cs249r_book/blob/main/book/docs/CONTRIBUTING.md). Feedback, corrections, and new ideas are welcome. Simply file a GitHub [issue](https://github.com/harvard-edge/cs249r_book/issues).

:::