mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-30 09:38:38 -05:00
- Vol1: chapter updates across backmatter, benchmarking, data, frameworks, etc. - Vol2: content updates, new appendices (assumptions, communication, fleet, reliability) - Quarto: config, styles, formulas, constants - Add SEMINAL_PAPERS_V2.md, learning_objectives_bolding_parallel.sh - VSCode extension: package.json, chapterNavigatorProvider - Landing page and docs updates
83 lines
4.5 KiB
Markdown
83 lines
4.5 KiB
Markdown
# Volume 2: Machine Learning Systems at Scale - Seminal Bibliography
|
||
|
||
This document tracks the foundational research papers, hardware architectures, and industry standards that anchor Volume 2. Organised for a "textbook-scale" deep dive (targeting 700-800 citations).
|
||
|
||
---
|
||
|
||
## Part I: Foundations of Scale (Distributed Logic)
|
||
|
||
### Distributed Training Paradigms
|
||
* **GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism** (Huang et al., 2019)
|
||
* **PipeDream: Generalized Pipeline Parallelism for DNN Training** (Narayanan et al., 2019)
|
||
* **Megatron-LM: Training Multi-Billion Parameter Models Using Model Parallelism** (Shoeybi et al., 2019)
|
||
* **ZeRO: Memory Optimizations Toward Training Trillion Parameter Models** (Rajbhandari et al., 2020)
|
||
* **Efficient Large-Scale Language Model Training on GPU Clusters** (Narayanan et al., 2021) - *The 3D Parallelism Blueprint.*
|
||
* **DistBelief: Large Scale Distributed Deep Networks** (Dean et al., 2012) - *Parameter Server origins.*
|
||
|
||
### Collective Communication & Algorithms
|
||
* **Bandwidth Optimal All-reduce Algorithms** (Patarasuk & Yuan, 2009) - *Ring AllReduce proof.*
|
||
* **Synthesizing Optimal Collective Algorithms (SCCL)** (Cai et al., 2020)
|
||
* **Rethinking ML Collective Communication as a Multi-Commodity Flow Problem** (Liu et al., 2024)
|
||
* **NCCL: Accelerated Multi-GPU Collective Communications** (NVIDIA, 2017-2025)
|
||
|
||
### Fault Tolerance & Resilience
|
||
* **Oobleck: Resilient Distributed Training using Pipeline Templates** (Jang et al., 2023)
|
||
* **Varuna: Scalable, Low-cost Training of Massive Models** (Athlur et al., 2022) - *Spot Instance resilience.*
|
||
* **Bamboo: Making Preemptible Instances Resilient for Affordable Training** (Thorpe et al., 2023)
|
||
|
||
---
|
||
|
||
## Part II: Building the Machine Learning Fleet (Physical Layer)
|
||
|
||
### Compute Infrastructure (Silicon & Systems)
|
||
* **In-Datacenter Performance Analysis of a TPU** (Jouppi et al., 2017) - *TPU v1.*
|
||
* **The Design Process for Google’s Training Chips: TPUv2 and TPUv3** (Norrie et al., 2021)
|
||
* **TPU v4: An Optically Reconfigurable Supercomputer** (Jouppi et al., 2023) - *SparseCores & Optical Switching.*
|
||
* **Dissecting the NVIDIA Hopper Architecture** (Luo et al., 2025) - *H100/H200 analysis.*
|
||
* **Microbenchmarking NVIDIA’s Blackwell Architecture** (Jarmusch et al., 2025) - *B200 analysis.*
|
||
* **Cerebras Wafer-Scale Integration: The Cerebras Story** (Lauterbach, 2021)
|
||
|
||
### Network Fabrics (Topologies & Protocols)
|
||
* **A Scalable, Commodity Data Center Network Architecture (Fat-Tree)** (Al-Fares et al., 2008)
|
||
* **Technology-Driven, Highly-Scalable Dragonfly Topology** (Kim et al., 2008)
|
||
* **Jellyfish: Networking Data Centers Randomly** (Singla et al., 2012)
|
||
* **Congestion Control for Large-Scale RDMA Deployments (DCQCN)** (Zhu et al., 2015)
|
||
* **HPCC: High Precision Congestion Control** (Li et al., 2019)
|
||
* **Swift: Delay is Simple and Effective for Congestion Control** (Kumar et al., 2020) - *Google's Swift protocol.*
|
||
|
||
### Memory & Interconnect Standards
|
||
* **HBM3: Enabling Memory Resilience at Scale** (Standardization Papers)
|
||
* **Compute Express Link (CXL): A Comprehensive Survey** (Lian et al., 2024)
|
||
* **Next-Gen Interconnection Systems with CXL** (2024)
|
||
|
||
---
|
||
|
||
## Part III: Deployment & Optimization (The Serving Layer)
|
||
|
||
### Inference & Serving
|
||
* **Efficient Memory Management for LLM Serving with PagedAttention (vLLM)** (Kwon et al., 2023)
|
||
* **Orca: A Distributed Serving System for [Transformer] Models** (Yu et al., 2022) - *Continuous Batching.*
|
||
* **FlexFlow: A Distributed Deep Learning Framework** (Jia et al., 2019)
|
||
|
||
### Performance Engineering (Quantization & Compression)
|
||
* **LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale** (Dettmers et al., 2022)
|
||
* **SmoothQuant: Accurate and Efficient Post-Training Quantization** (Xiao et al., 2023)
|
||
* **FlashAttention: Fast and Memory-Efficient Exact Attention** (Dao et al., 2022)
|
||
|
||
---
|
||
|
||
## Part IV: The Vanguard (The Future of Scale)
|
||
|
||
### Optical & Photonic Systems
|
||
* **Leveraging Optical Chip-to-chip Connectivity** (Ayar Labs, 2023)
|
||
* **Photonic AI Acceleration: A New Kind of Computer** (Lightmatter, 2025)
|
||
* **Panel-Scale Reconfigurable Photonic Interconnects** (Hsueh et al., 2025)
|
||
|
||
---
|
||
|
||
## Core Textbooks (Strategic Guides)
|
||
* **Computer Architecture: A Quantitative Approach** (Hennessy & Patterson) - *The architectural Bible.*
|
||
* **Designing Machine Learning Systems** (Chip Huyen, 2022)
|
||
* **Designing Data-Intensive Applications** (Martin Kleppmann, 2017)
|
||
* **Distributed Systems: Principles and Paradigms** (Tanenbaum & van Steen)
|