mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-03-09 07:15:51 -05:00

Files

Vijay Janapa Reddi 09602445de chore: update book content, config, appendices, and tooling

- Vol1: chapter updates across backmatter, benchmarking, data, frameworks, etc.
- Vol2: content updates, new appendices (assumptions, communication, fleet, reliability)
- Quarto: config, styles, formulas, constants
- Add SEMINAL_PAPERS_V2.md, learning_objectives_bolding_parallel.sh
- VSCode extension: package.json, chapterNavigatorProvider
- Landing page and docs updates

2026-02-20 18:55:24 -05:00

4.5 KiB

Raw Blame History

Volume 2: Machine Learning Systems at Scale - Seminal Bibliography

This document tracks the foundational research papers, hardware architectures, and industry standards that anchor Volume 2. Organised for a "textbook-scale" deep dive (targeting 700-800 citations).

Part I: Foundations of Scale (Distributed Logic)

Distributed Training Paradigms

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (Huang et al., 2019)
PipeDream: Generalized Pipeline Parallelism for DNN Training (Narayanan et al., 2019)
Megatron-LM: Training Multi-Billion Parameter Models Using Model Parallelism (Shoeybi et al., 2019)
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (Rajbhandari et al., 2020)
Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al., 2021) - The 3D Parallelism Blueprint.
DistBelief: Large Scale Distributed Deep Networks (Dean et al., 2012) - Parameter Server origins.

Collective Communication & Algorithms

Bandwidth Optimal All-reduce Algorithms (Patarasuk & Yuan, 2009) - Ring AllReduce proof.
Synthesizing Optimal Collective Algorithms (SCCL) (Cai et al., 2020)
Rethinking ML Collective Communication as a Multi-Commodity Flow Problem (Liu et al., 2024)
NCCL: Accelerated Multi-GPU Collective Communications (NVIDIA, 2017-2025)

Fault Tolerance & Resilience

Oobleck: Resilient Distributed Training using Pipeline Templates (Jang et al., 2023)
Varuna: Scalable, Low-cost Training of Massive Models (Athlur et al., 2022) - Spot Instance resilience.
Bamboo: Making Preemptible Instances Resilient for Affordable Training (Thorpe et al., 2023)

Part II: Building the Machine Learning Fleet (Physical Layer)

Compute Infrastructure (Silicon & Systems)

In-Datacenter Performance Analysis of a TPU (Jouppi et al., 2017) - TPU v1.
The Design Process for Google’s Training Chips: TPUv2 and TPUv3 (Norrie et al., 2021)
TPU v4: An Optically Reconfigurable Supercomputer (Jouppi et al., 2023) - SparseCores & Optical Switching.
Dissecting the NVIDIA Hopper Architecture (Luo et al., 2025) - H100/H200 analysis.
Microbenchmarking NVIDIA’s Blackwell Architecture (Jarmusch et al., 2025) - B200 analysis.
Cerebras Wafer-Scale Integration: The Cerebras Story (Lauterbach, 2021)

Network Fabrics (Topologies & Protocols)

A Scalable, Commodity Data Center Network Architecture (Fat-Tree) (Al-Fares et al., 2008)
Technology-Driven, Highly-Scalable Dragonfly Topology (Kim et al., 2008)
Jellyfish: Networking Data Centers Randomly (Singla et al., 2012)
Congestion Control for Large-Scale RDMA Deployments (DCQCN) (Zhu et al., 2015)
HPCC: High Precision Congestion Control (Li et al., 2019)
Swift: Delay is Simple and Effective for Congestion Control (Kumar et al., 2020) - Google's Swift protocol.

Memory & Interconnect Standards

HBM3: Enabling Memory Resilience at Scale (Standardization Papers)
Compute Express Link (CXL): A Comprehensive Survey (Lian et al., 2024)
Next-Gen Interconnection Systems with CXL (2024)

Part III: Deployment & Optimization (The Serving Layer)

Inference & Serving

Efficient Memory Management for LLM Serving with PagedAttention (vLLM) (Kwon et al., 2023)
Orca: A Distributed Serving System for [Transformer] Models (Yu et al., 2022) - Continuous Batching.
FlexFlow: A Distributed Deep Learning Framework (Jia et al., 2019)

Performance Engineering (Quantization & Compression)

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022)
SmoothQuant: Accurate and Efficient Post-Training Quantization (Xiao et al., 2023)
FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al., 2022)

Part IV: The Vanguard (The Future of Scale)

Optical & Photonic Systems

Leveraging Optical Chip-to-chip Connectivity (Ayar Labs, 2023)
Photonic AI Acceleration: A New Kind of Computer (Lightmatter, 2025)
Panel-Scale Reconfigurable Photonic Interconnects (Hsueh et al., 2025)

Core Textbooks (Strategic Guides)

Computer Architecture: A Quantitative Approach (Hennessy & Patterson) - The architectural Bible.
Designing Machine Learning Systems (Chip Huyen, 2022)
Designing Data-Intensive Applications (Martin Kleppmann, 2017)
Distributed Systems: Principles and Paradigms (Tanenbaum & van Steen)

4.5 KiB Raw Blame History Unescape Escape