mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-01 18:19:18 -05:00
- Vol1: chapter updates across backmatter, benchmarking, data, frameworks, etc. - Vol2: content updates, new appendices (assumptions, communication, fleet, reliability) - Quarto: config, styles, formulas, constants - Add SEMINAL_PAPERS_V2.md, learning_objectives_bolding_parallel.sh - VSCode extension: package.json, chapterNavigatorProvider - Landing page and docs updates
179 lines
6.2 KiB
Markdown
179 lines
6.2 KiB
Markdown
# Machine Learning Systems: Two-Volume Structure
|
|
|
|
**Status**: Implemented
|
|
**Target Publisher**: MIT Press
|
|
**Audience**: Undergraduate and graduate CS/ECE students, academic courses
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This textbook is organized into two volumes following the Hennessy & Patterson pedagogical model:
|
|
|
|
- **Volume I: Introduction to Machine Learning Systems** — Build, Optimize, Deploy
|
|
- **Volume II: Machine Learning Systems at Scale** — Scale, Distribute, Govern
|
|
|
|
Each volume stands alone as a complete learning experience while together forming a comprehensive treatment of the field.
|
|
|
|
---
|
|
|
|
## Volume I: Introduction to Machine Learning Systems
|
|
|
|
### Goal
|
|
A reader completes Volume I and can competently build, optimize, and deploy ML systems on a single machine with awareness of responsible practices.
|
|
|
|
### Target Audience
|
|
- Upper-level undergraduates
|
|
- Early graduate students
|
|
- Practitioners transitioning into ML systems
|
|
|
|
### Course Mapping
|
|
- Single semester "Introduction to Machine Learning Systems" course
|
|
- Foundation for more advanced distributed systems or MLOps courses
|
|
|
|
### Structure (16 chapters)
|
|
|
|
#### Part I: Foundations
|
|
Establish the conceptual framework for understanding ML as a systems discipline.
|
|
|
|
| Ch | Title | Purpose |
|
|
|----|-------|---------|
|
|
| 1 | Introduction | Why ML systems thinking matters |
|
|
| 2 | ML Systems | Survey of the field, deployment paradigms |
|
|
| 3 | ML Workflow | End-to-end ML development process |
|
|
| 4 | Data Engineering | Pipelines, preprocessing, data quality |
|
|
|
|
#### Part II: Build
|
|
The technical implementation of machine learning systems from math to trained models.
|
|
|
|
| Ch | Title | Purpose |
|
|
|----|-------|---------|
|
|
| 5 | Neural Computation | Mathematical and conceptual foundations |
|
|
| 6 | Network Architectures | CNNs, RNNs, Transformers, architectural choices |
|
|
| 7 | ML Frameworks | PyTorch, TensorFlow, JAX ecosystem |
|
|
| 8 | Model Training | Training loops, optimization, debugging |
|
|
|
|
#### Part III: Optimization
|
|
Techniques for making ML systems efficient and fast.
|
|
|
|
| Ch | Title | Purpose |
|
|
|----|-------|---------|
|
|
| 9 | Data Selection | Optimizing information, active learning, pruning |
|
|
| 10 | Model Compression | Quantization, pruning, distillation |
|
|
| 11 | Hardware Acceleration | GPUs, TPUs, custom accelerators |
|
|
| 12 | Benchmarking | Measuring performance, MLPerf |
|
|
|
|
#### Part IV: Deployment
|
|
Getting models into production responsibly.
|
|
|
|
| Ch | Title | Purpose |
|
|
|----|-------|---------|
|
|
| 13 | Model Serving | Inference fundamentals, batching, latency optimization |
|
|
| 14 | ML Operations | Deployment, monitoring, CI/CD for ML |
|
|
| 15 | Responsible Engineering | Ethics, safety, and professional practice |
|
|
| 16 | Conclusion | Synthesis and bridge to Volume II |
|
|
|
|
---
|
|
|
|
## Volume II: Machine Learning Systems at Scale
|
|
|
|
### Goal
|
|
A reader completes Volume II understanding how to build and operate ML systems at scale, with production resilience and responsible practices.
|
|
|
|
### Target Audience
|
|
- Graduate students
|
|
- Industry practitioners
|
|
- Researchers building large-scale systems
|
|
|
|
### Prerequisites
|
|
- Volume I or equivalent knowledge
|
|
- Basic distributed systems concepts helpful
|
|
|
|
### Course Mapping
|
|
- Graduate seminar on large-scale ML systems
|
|
- Advanced MLOps course
|
|
- Research group reading material
|
|
|
|
### Structure (16 chapters)
|
|
|
|
#### Part I: Foundations of Scale
|
|
Infrastructure and concepts for scaling beyond single machines.
|
|
|
|
| Ch | Title | Purpose |
|
|
|----|-------|---------|
|
|
| 1 | Introduction | Motivation, challenges of scale |
|
|
| 2 | Infrastructure | Clusters, cloud, resource management |
|
|
| 3 | Storage Systems | Data lakes, distributed storage, checkpointing |
|
|
| 4 | Communication | AllReduce, parameter servers, network topology |
|
|
|
|
#### Part II: Distributed Systems
|
|
Training and inference across multiple machines.
|
|
|
|
| Ch | Title | Purpose |
|
|
|----|-------|---------|
|
|
| 5 | Distributed Training | Parallelism strategies, multi-chip hardware, scaling infrastructure |
|
|
| 6 | Fault Tolerance | Checkpointing, recovery, handling failures |
|
|
| 7 | Inference at Scale | Serving systems, batching, latency optimization |
|
|
| 8 | Edge Intelligence | Federated learning, fleet coordination, on-device adaptation |
|
|
|
|
#### Part III: Production Challenges
|
|
Real-world complexities of operating ML systems.
|
|
|
|
| Ch | Title | Purpose |
|
|
|----|-------|---------|
|
|
| 9 | Privacy & Security | Differential privacy, secure computation, attacks |
|
|
| 10 | Robust AI | Adversarial robustness, distribution shift |
|
|
| 11 | ML Ops at Scale | Advanced MLOps, platform engineering |
|
|
| 12 | Sustainable AI | Environmental impact, efficient computing |
|
|
|
|
#### Part IV: Responsible Deployment
|
|
Building ML systems that benefit society.
|
|
|
|
| Ch | Title | Purpose |
|
|
|----|-------|---------|
|
|
| 13 | Responsible AI | Fairness, accountability, transparency |
|
|
| 14 | AI for Good | Applications for societal benefit |
|
|
| 15 | Frontiers | Emerging trends, open problems |
|
|
| 16 | Conclusion | Synthesis, future of the field |
|
|
|
|
---
|
|
|
|
## Key Design Decisions
|
|
|
|
### Why This Split?
|
|
|
|
1. **Pedagogical Progression**: Volume I covers what every ML practitioner needs. Volume II covers what scale/production engineers need.
|
|
|
|
2. **Course Adoptability**: Volume I maps to a single semester intro course. Volume II maps to an advanced graduate seminar.
|
|
|
|
3. **Standalone Completeness**: A reader of only Volume I gets responsible engineering awareness through Chapter 14.
|
|
|
|
4. **Industry Alignment**: Volume I produces capable junior engineers. Volume II produces senior/staff-level systems thinkers.
|
|
|
|
### The Hennessy & Patterson Test
|
|
|
|
When deciding where content belongs, ask: **What is the SCOPE of the system being discussed?**
|
|
|
|
| Aspect | Volume I | Volume II |
|
|
|--------|----------|-----------|
|
|
| **Scope** | Single-machine systems (1-8 GPUs) | Multi-machine distributed systems |
|
|
| **Math & Theory** | Full rigor, derivations | Full rigor, derivations |
|
|
| **Performance Metrics** | Single-system analysis | Scaling/efficiency analysis |
|
|
| **Code Examples** | Single-node implementations | Multi-node implementations |
|
|
|
|
---
|
|
|
|
## Summary Statistics
|
|
|
|
| Metric | Volume I | Volume II |
|
|
|--------|----------|-----------|
|
|
| Chapters | 16 | 16 |
|
|
| Parts | 4 | 4 |
|
|
| Focus | Single system | Distributed systems |
|
|
| Prerequisite | None | Volume I |
|
|
|
|
---
|
|
|
|
*Document Version: January 2025*
|
|
*Reflects current implementation in `_quarto-html.yml`*
|