Files
cs249r_book/book/docs/VOLUME_STRUCTURE.md
Vijay Janapa Reddi 09602445de chore: update book content, config, appendices, and tooling
- Vol1: chapter updates across backmatter, benchmarking, data, frameworks, etc.
- Vol2: content updates, new appendices (assumptions, communication, fleet, reliability)
- Quarto: config, styles, formulas, constants
- Add SEMINAL_PAPERS_V2.md, learning_objectives_bolding_parallel.sh
- VSCode extension: package.json, chapterNavigatorProvider
- Landing page and docs updates
2026-02-20 18:55:24 -05:00

6.2 KiB

Machine Learning Systems: Two-Volume Structure

Status: Implemented Target Publisher: MIT Press Audience: Undergraduate and graduate CS/ECE students, academic courses


Overview

This textbook is organized into two volumes following the Hennessy & Patterson pedagogical model:

  • Volume I: Introduction to Machine Learning Systems — Build, Optimize, Deploy
  • Volume II: Machine Learning Systems at Scale — Scale, Distribute, Govern

Each volume stands alone as a complete learning experience while together forming a comprehensive treatment of the field.


Volume I: Introduction to Machine Learning Systems

Goal

A reader completes Volume I and can competently build, optimize, and deploy ML systems on a single machine with awareness of responsible practices.

Target Audience

  • Upper-level undergraduates
  • Early graduate students
  • Practitioners transitioning into ML systems

Course Mapping

  • Single semester "Introduction to Machine Learning Systems" course
  • Foundation for more advanced distributed systems or MLOps courses

Structure (16 chapters)

Part I: Foundations

Establish the conceptual framework for understanding ML as a systems discipline.

Ch Title Purpose
1 Introduction Why ML systems thinking matters
2 ML Systems Survey of the field, deployment paradigms
3 ML Workflow End-to-end ML development process
4 Data Engineering Pipelines, preprocessing, data quality

Part II: Build

The technical implementation of machine learning systems from math to trained models.

Ch Title Purpose
5 Neural Computation Mathematical and conceptual foundations
6 Network Architectures CNNs, RNNs, Transformers, architectural choices
7 ML Frameworks PyTorch, TensorFlow, JAX ecosystem
8 Model Training Training loops, optimization, debugging

Part III: Optimization

Techniques for making ML systems efficient and fast.

Ch Title Purpose
9 Data Selection Optimizing information, active learning, pruning
10 Model Compression Quantization, pruning, distillation
11 Hardware Acceleration GPUs, TPUs, custom accelerators
12 Benchmarking Measuring performance, MLPerf

Part IV: Deployment

Getting models into production responsibly.

Ch Title Purpose
13 Model Serving Inference fundamentals, batching, latency optimization
14 ML Operations Deployment, monitoring, CI/CD for ML
15 Responsible Engineering Ethics, safety, and professional practice
16 Conclusion Synthesis and bridge to Volume II

Volume II: Machine Learning Systems at Scale

Goal

A reader completes Volume II understanding how to build and operate ML systems at scale, with production resilience and responsible practices.

Target Audience

  • Graduate students
  • Industry practitioners
  • Researchers building large-scale systems

Prerequisites

  • Volume I or equivalent knowledge
  • Basic distributed systems concepts helpful

Course Mapping

  • Graduate seminar on large-scale ML systems
  • Advanced MLOps course
  • Research group reading material

Structure (16 chapters)

Part I: Foundations of Scale

Infrastructure and concepts for scaling beyond single machines.

Ch Title Purpose
1 Introduction Motivation, challenges of scale
2 Infrastructure Clusters, cloud, resource management
3 Storage Systems Data lakes, distributed storage, checkpointing
4 Communication AllReduce, parameter servers, network topology

Part II: Distributed Systems

Training and inference across multiple machines.

Ch Title Purpose
5 Distributed Training Parallelism strategies, multi-chip hardware, scaling infrastructure
6 Fault Tolerance Checkpointing, recovery, handling failures
7 Inference at Scale Serving systems, batching, latency optimization
8 Edge Intelligence Federated learning, fleet coordination, on-device adaptation

Part III: Production Challenges

Real-world complexities of operating ML systems.

Ch Title Purpose
9 Privacy & Security Differential privacy, secure computation, attacks
10 Robust AI Adversarial robustness, distribution shift
11 ML Ops at Scale Advanced MLOps, platform engineering
12 Sustainable AI Environmental impact, efficient computing

Part IV: Responsible Deployment

Building ML systems that benefit society.

Ch Title Purpose
13 Responsible AI Fairness, accountability, transparency
14 AI for Good Applications for societal benefit
15 Frontiers Emerging trends, open problems
16 Conclusion Synthesis, future of the field

Key Design Decisions

Why This Split?

  1. Pedagogical Progression: Volume I covers what every ML practitioner needs. Volume II covers what scale/production engineers need.

  2. Course Adoptability: Volume I maps to a single semester intro course. Volume II maps to an advanced graduate seminar.

  3. Standalone Completeness: A reader of only Volume I gets responsible engineering awareness through Chapter 14.

  4. Industry Alignment: Volume I produces capable junior engineers. Volume II produces senior/staff-level systems thinkers.

The Hennessy & Patterson Test

When deciding where content belongs, ask: What is the SCOPE of the system being discussed?

Aspect Volume I Volume II
Scope Single-machine systems (1-8 GPUs) Multi-machine distributed systems
Math & Theory Full rigor, derivations Full rigor, derivations
Performance Metrics Single-system analysis Scaling/efficiency analysis
Code Examples Single-node implementations Multi-node implementations

Summary Statistics

Metric Volume I Volume II
Chapters 16 16
Parts 4 4
Focus Single system Distributed systems
Prerequisite None Volume I

Document Version: January 2025 Reflects current implementation in _quarto-html.yml