Files
cs249r_book/SEMINAL_PAPERS_V2.md
Vijay Janapa Reddi 09602445de chore: update book content, config, appendices, and tooling
- Vol1: chapter updates across backmatter, benchmarking, data, frameworks, etc.
- Vol2: content updates, new appendices (assumptions, communication, fleet, reliability)
- Quarto: config, styles, formulas, constants
- Add SEMINAL_PAPERS_V2.md, learning_objectives_bolding_parallel.sh
- VSCode extension: package.json, chapterNavigatorProvider
- Landing page and docs updates
2026-02-20 18:55:24 -05:00

4.5 KiB
Raw Blame History

Volume 2: Machine Learning Systems at Scale - Seminal Bibliography

This document tracks the foundational research papers, hardware architectures, and industry standards that anchor Volume 2. Organised for a "textbook-scale" deep dive (targeting 700-800 citations).


Part I: Foundations of Scale (Distributed Logic)

Distributed Training Paradigms

  • GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (Huang et al., 2019)
  • PipeDream: Generalized Pipeline Parallelism for DNN Training (Narayanan et al., 2019)
  • Megatron-LM: Training Multi-Billion Parameter Models Using Model Parallelism (Shoeybi et al., 2019)
  • ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (Rajbhandari et al., 2020)
  • Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al., 2021) - The 3D Parallelism Blueprint.
  • DistBelief: Large Scale Distributed Deep Networks (Dean et al., 2012) - Parameter Server origins.

Collective Communication & Algorithms

  • Bandwidth Optimal All-reduce Algorithms (Patarasuk & Yuan, 2009) - Ring AllReduce proof.
  • Synthesizing Optimal Collective Algorithms (SCCL) (Cai et al., 2020)
  • Rethinking ML Collective Communication as a Multi-Commodity Flow Problem (Liu et al., 2024)
  • NCCL: Accelerated Multi-GPU Collective Communications (NVIDIA, 2017-2025)

Fault Tolerance & Resilience

  • Oobleck: Resilient Distributed Training using Pipeline Templates (Jang et al., 2023)
  • Varuna: Scalable, Low-cost Training of Massive Models (Athlur et al., 2022) - Spot Instance resilience.
  • Bamboo: Making Preemptible Instances Resilient for Affordable Training (Thorpe et al., 2023)

Part II: Building the Machine Learning Fleet (Physical Layer)

Compute Infrastructure (Silicon & Systems)

  • In-Datacenter Performance Analysis of a TPU (Jouppi et al., 2017) - TPU v1.
  • The Design Process for Googles Training Chips: TPUv2 and TPUv3 (Norrie et al., 2021)
  • TPU v4: An Optically Reconfigurable Supercomputer (Jouppi et al., 2023) - SparseCores & Optical Switching.
  • Dissecting the NVIDIA Hopper Architecture (Luo et al., 2025) - H100/H200 analysis.
  • Microbenchmarking NVIDIAs Blackwell Architecture (Jarmusch et al., 2025) - B200 analysis.
  • Cerebras Wafer-Scale Integration: The Cerebras Story (Lauterbach, 2021)

Network Fabrics (Topologies & Protocols)

  • A Scalable, Commodity Data Center Network Architecture (Fat-Tree) (Al-Fares et al., 2008)
  • Technology-Driven, Highly-Scalable Dragonfly Topology (Kim et al., 2008)
  • Jellyfish: Networking Data Centers Randomly (Singla et al., 2012)
  • Congestion Control for Large-Scale RDMA Deployments (DCQCN) (Zhu et al., 2015)
  • HPCC: High Precision Congestion Control (Li et al., 2019)
  • Swift: Delay is Simple and Effective for Congestion Control (Kumar et al., 2020) - Google's Swift protocol.

Memory & Interconnect Standards

  • HBM3: Enabling Memory Resilience at Scale (Standardization Papers)
  • Compute Express Link (CXL): A Comprehensive Survey (Lian et al., 2024)
  • Next-Gen Interconnection Systems with CXL (2024)

Part III: Deployment & Optimization (The Serving Layer)

Inference & Serving

  • Efficient Memory Management for LLM Serving with PagedAttention (vLLM) (Kwon et al., 2023)
  • Orca: A Distributed Serving System for [Transformer] Models (Yu et al., 2022) - Continuous Batching.
  • FlexFlow: A Distributed Deep Learning Framework (Jia et al., 2019)

Performance Engineering (Quantization & Compression)

  • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022)
  • SmoothQuant: Accurate and Efficient Post-Training Quantization (Xiao et al., 2023)
  • FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al., 2022)

Part IV: The Vanguard (The Future of Scale)

Optical & Photonic Systems

  • Leveraging Optical Chip-to-chip Connectivity (Ayar Labs, 2023)
  • Photonic AI Acceleration: A New Kind of Computer (Lightmatter, 2025)
  • Panel-Scale Reconfigurable Photonic Interconnects (Hsueh et al., 2025)

Core Textbooks (Strategic Guides)

  • Computer Architecture: A Quantitative Approach (Hennessy & Patterson) - The architectural Bible.
  • Designing Machine Learning Systems (Chip Huyen, 2022)
  • Designing Data-Intensive Applications (Martin Kleppmann, 2017)
  • Distributed Systems: Principles and Paradigms (Tanenbaum & van Steen)