Files
cs249r_book/SEMINAL_PAPERS_V2.md
Vijay Janapa Reddi 09602445de chore: update book content, config, appendices, and tooling
- Vol1: chapter updates across backmatter, benchmarking, data, frameworks, etc.
- Vol2: content updates, new appendices (assumptions, communication, fleet, reliability)
- Quarto: config, styles, formulas, constants
- Add SEMINAL_PAPERS_V2.md, learning_objectives_bolding_parallel.sh
- VSCode extension: package.json, chapterNavigatorProvider
- Landing page and docs updates
2026-02-20 18:55:24 -05:00

83 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Volume 2: Machine Learning Systems at Scale - Seminal Bibliography
This document tracks the foundational research papers, hardware architectures, and industry standards that anchor Volume 2. Organised for a "textbook-scale" deep dive (targeting 700-800 citations).
---
## Part I: Foundations of Scale (Distributed Logic)
### Distributed Training Paradigms
* **GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism** (Huang et al., 2019)
* **PipeDream: Generalized Pipeline Parallelism for DNN Training** (Narayanan et al., 2019)
* **Megatron-LM: Training Multi-Billion Parameter Models Using Model Parallelism** (Shoeybi et al., 2019)
* **ZeRO: Memory Optimizations Toward Training Trillion Parameter Models** (Rajbhandari et al., 2020)
* **Efficient Large-Scale Language Model Training on GPU Clusters** (Narayanan et al., 2021) - *The 3D Parallelism Blueprint.*
* **DistBelief: Large Scale Distributed Deep Networks** (Dean et al., 2012) - *Parameter Server origins.*
### Collective Communication & Algorithms
* **Bandwidth Optimal All-reduce Algorithms** (Patarasuk & Yuan, 2009) - *Ring AllReduce proof.*
* **Synthesizing Optimal Collective Algorithms (SCCL)** (Cai et al., 2020)
* **Rethinking ML Collective Communication as a Multi-Commodity Flow Problem** (Liu et al., 2024)
* **NCCL: Accelerated Multi-GPU Collective Communications** (NVIDIA, 2017-2025)
### Fault Tolerance & Resilience
* **Oobleck: Resilient Distributed Training using Pipeline Templates** (Jang et al., 2023)
* **Varuna: Scalable, Low-cost Training of Massive Models** (Athlur et al., 2022) - *Spot Instance resilience.*
* **Bamboo: Making Preemptible Instances Resilient for Affordable Training** (Thorpe et al., 2023)
---
## Part II: Building the Machine Learning Fleet (Physical Layer)
### Compute Infrastructure (Silicon & Systems)
* **In-Datacenter Performance Analysis of a TPU** (Jouppi et al., 2017) - *TPU v1.*
* **The Design Process for Googles Training Chips: TPUv2 and TPUv3** (Norrie et al., 2021)
* **TPU v4: An Optically Reconfigurable Supercomputer** (Jouppi et al., 2023) - *SparseCores & Optical Switching.*
* **Dissecting the NVIDIA Hopper Architecture** (Luo et al., 2025) - *H100/H200 analysis.*
* **Microbenchmarking NVIDIAs Blackwell Architecture** (Jarmusch et al., 2025) - *B200 analysis.*
* **Cerebras Wafer-Scale Integration: The Cerebras Story** (Lauterbach, 2021)
### Network Fabrics (Topologies & Protocols)
* **A Scalable, Commodity Data Center Network Architecture (Fat-Tree)** (Al-Fares et al., 2008)
* **Technology-Driven, Highly-Scalable Dragonfly Topology** (Kim et al., 2008)
* **Jellyfish: Networking Data Centers Randomly** (Singla et al., 2012)
* **Congestion Control for Large-Scale RDMA Deployments (DCQCN)** (Zhu et al., 2015)
* **HPCC: High Precision Congestion Control** (Li et al., 2019)
* **Swift: Delay is Simple and Effective for Congestion Control** (Kumar et al., 2020) - *Google's Swift protocol.*
### Memory & Interconnect Standards
* **HBM3: Enabling Memory Resilience at Scale** (Standardization Papers)
* **Compute Express Link (CXL): A Comprehensive Survey** (Lian et al., 2024)
* **Next-Gen Interconnection Systems with CXL** (2024)
---
## Part III: Deployment & Optimization (The Serving Layer)
### Inference & Serving
* **Efficient Memory Management for LLM Serving with PagedAttention (vLLM)** (Kwon et al., 2023)
* **Orca: A Distributed Serving System for [Transformer] Models** (Yu et al., 2022) - *Continuous Batching.*
* **FlexFlow: A Distributed Deep Learning Framework** (Jia et al., 2019)
### Performance Engineering (Quantization & Compression)
* **LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale** (Dettmers et al., 2022)
* **SmoothQuant: Accurate and Efficient Post-Training Quantization** (Xiao et al., 2023)
* **FlashAttention: Fast and Memory-Efficient Exact Attention** (Dao et al., 2022)
---
## Part IV: The Vanguard (The Future of Scale)
### Optical & Photonic Systems
* **Leveraging Optical Chip-to-chip Connectivity** (Ayar Labs, 2023)
* **Photonic AI Acceleration: A New Kind of Computer** (Lightmatter, 2025)
* **Panel-Scale Reconfigurable Photonic Interconnects** (Hsueh et al., 2025)
---
## Core Textbooks (Strategic Guides)
* **Computer Architecture: A Quantitative Approach** (Hennessy & Patterson) - *The architectural Bible.*
* **Designing Machine Learning Systems** (Chip Huyen, 2022)
* **Designing Data-Intensive Applications** (Martin Kleppmann, 2017)
* **Distributed Systems: Principles and Paradigms** (Tanenbaum & van Steen)