cs249r_book/SEMINAL_PAPERS_V2.md

# Volume 2: Machine Learning Systems at Scale - Seminal Bibliography

This document tracks the foundational research papers, hardware architectures, and industry standards that anchor Volume 2. Organised for a "textbook-scale" deep dive (targeting 700-800 citations).

---

## Part I: Foundations of Scale (Distributed Logic)

### Distributed Training Paradigms
*   **GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism** (Huang et al., 2019)
*   **PipeDream: Generalized Pipeline Parallelism for DNN Training** (Narayanan et al., 2019)
*   **Megatron-LM: Training Multi-Billion Parameter Models Using Model Parallelism** (Shoeybi et al., 2019)
*   **ZeRO: Memory Optimizations Toward Training Trillion Parameter Models** (Rajbhandari et al., 2020)
*   **Efficient Large-Scale Language Model Training on GPU Clusters** (Narayanan et al., 2021) - *The 3D Parallelism Blueprint.*
*   **DistBelief: Large Scale Distributed Deep Networks** (Dean et al., 2012) - *Parameter Server origins.*

### Collective Communication & Algorithms
*   **Bandwidth Optimal All-reduce Algorithms** (Patarasuk & Yuan, 2009) - *Ring AllReduce proof.*
*   **Synthesizing Optimal Collective Algorithms (SCCL)** (Cai et al., 2020)
*   **Rethinking ML Collective Communication as a Multi-Commodity Flow Problem** (Liu et al., 2024)
*   **NCCL: Accelerated Multi-GPU Collective Communications** (NVIDIA, 2017-2025)

### Fault Tolerance & Resilience
*   **Oobleck: Resilient Distributed Training using Pipeline Templates** (Jang et al., 2023)
*   **Varuna: Scalable, Low-cost Training of Massive Models** (Athlur et al., 2022) - *Spot Instance resilience.*
*   **Bamboo: Making Preemptible Instances Resilient for Affordable Training** (Thorpe et al., 2023)

---

## Part II: Building the Machine Learning Fleet (Physical Layer)

### Compute Infrastructure (Silicon & Systems)
*   **In-Datacenter Performance Analysis of a TPU** (Jouppi et al., 2017) - *TPU v1.*
*   **The Design Process for Google’s Training Chips: TPUv2 and TPUv3** (Norrie et al., 2021)
*   **TPU v4: An Optically Reconfigurable Supercomputer** (Jouppi et al., 2023) - *SparseCores & Optical Switching.*
*   **Dissecting the NVIDIA Hopper Architecture** (Luo et al., 2025) - *H100/H200 analysis.*
*   **Microbenchmarking NVIDIA’s Blackwell Architecture** (Jarmusch et al., 2025) - *B200 analysis.*
*   **Cerebras Wafer-Scale Integration: The Cerebras Story** (Lauterbach, 2021)

### Network Fabrics (Topologies & Protocols)
*   **A Scalable, Commodity Data Center Network Architecture (Fat-Tree)** (Al-Fares et al., 2008)
*   **Technology-Driven, Highly-Scalable Dragonfly Topology** (Kim et al., 2008)
*   **Jellyfish: Networking Data Centers Randomly** (Singla et al., 2012)
*   **Congestion Control for Large-Scale RDMA Deployments (DCQCN)** (Zhu et al., 2015)
*   **HPCC: High Precision Congestion Control** (Li et al., 2019)
*   **Swift: Delay is Simple and Effective for Congestion Control** (Kumar et al., 2020) - *Google's Swift protocol.*

### Memory & Interconnect Standards
*   **HBM3: Enabling Memory Resilience at Scale** (Standardization Papers)
*   **Compute Express Link (CXL): A Comprehensive Survey** (Lian et al., 2024)
*   **Next-Gen Interconnection Systems with CXL** (2024)

---

## Part III: Deployment & Optimization (The Serving Layer)

### Inference & Serving
*   **Efficient Memory Management for LLM Serving with PagedAttention (vLLM)** (Kwon et al., 2023)
*   **Orca: A Distributed Serving System for [Transformer] Models** (Yu et al., 2022) - *Continuous Batching.*
*   **FlexFlow: A Distributed Deep Learning Framework** (Jia et al., 2019)

### Performance Engineering (Quantization & Compression)
*   **LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale** (Dettmers et al., 2022)
*   **SmoothQuant: Accurate and Efficient Post-Training Quantization** (Xiao et al., 2023)
*   **FlashAttention: Fast and Memory-Efficient Exact Attention** (Dao et al., 2022)

---

## Part IV: The Vanguard (The Future of Scale)

### Optical & Photonic Systems
*   **Leveraging Optical Chip-to-chip Connectivity** (Ayar Labs, 2023)
*   **Photonic AI Acceleration: A New Kind of Computer** (Lightmatter, 2025)
*   **Panel-Scale Reconfigurable Photonic Interconnects** (Hsueh et al., 2025)

---

## Core Textbooks (Strategic Guides)
*   **Computer Architecture: A Quantitative Approach** (Hennessy & Patterson) - *The architectural Bible.*
*   **Designing Machine Learning Systems** (Chip Huyen, 2022)
*   **Designing Data-Intensive Applications** (Martin Kleppmann, 2017)
*   **Distributed Systems: Principles and Paradigms** (Tanenbaum & van Steen)