mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-09 07:15:51 -05:00
- Vol1: chapter updates across backmatter, benchmarking, data, frameworks, etc. - Vol2: content updates, new appendices (assumptions, communication, fleet, reliability) - Quarto: config, styles, formulas, constants - Add SEMINAL_PAPERS_V2.md, learning_objectives_bolding_parallel.sh - VSCode extension: package.json, chapterNavigatorProvider - Landing page and docs updates
4.5 KiB
4.5 KiB
Volume 2: Machine Learning Systems at Scale - Seminal Bibliography
This document tracks the foundational research papers, hardware architectures, and industry standards that anchor Volume 2. Organised for a "textbook-scale" deep dive (targeting 700-800 citations).
Part I: Foundations of Scale (Distributed Logic)
Distributed Training Paradigms
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (Huang et al., 2019)
- PipeDream: Generalized Pipeline Parallelism for DNN Training (Narayanan et al., 2019)
- Megatron-LM: Training Multi-Billion Parameter Models Using Model Parallelism (Shoeybi et al., 2019)
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (Rajbhandari et al., 2020)
- Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al., 2021) - The 3D Parallelism Blueprint.
- DistBelief: Large Scale Distributed Deep Networks (Dean et al., 2012) - Parameter Server origins.
Collective Communication & Algorithms
- Bandwidth Optimal All-reduce Algorithms (Patarasuk & Yuan, 2009) - Ring AllReduce proof.
- Synthesizing Optimal Collective Algorithms (SCCL) (Cai et al., 2020)
- Rethinking ML Collective Communication as a Multi-Commodity Flow Problem (Liu et al., 2024)
- NCCL: Accelerated Multi-GPU Collective Communications (NVIDIA, 2017-2025)
Fault Tolerance & Resilience
- Oobleck: Resilient Distributed Training using Pipeline Templates (Jang et al., 2023)
- Varuna: Scalable, Low-cost Training of Massive Models (Athlur et al., 2022) - Spot Instance resilience.
- Bamboo: Making Preemptible Instances Resilient for Affordable Training (Thorpe et al., 2023)
Part II: Building the Machine Learning Fleet (Physical Layer)
Compute Infrastructure (Silicon & Systems)
- In-Datacenter Performance Analysis of a TPU (Jouppi et al., 2017) - TPU v1.
- The Design Process for Google’s Training Chips: TPUv2 and TPUv3 (Norrie et al., 2021)
- TPU v4: An Optically Reconfigurable Supercomputer (Jouppi et al., 2023) - SparseCores & Optical Switching.
- Dissecting the NVIDIA Hopper Architecture (Luo et al., 2025) - H100/H200 analysis.
- Microbenchmarking NVIDIA’s Blackwell Architecture (Jarmusch et al., 2025) - B200 analysis.
- Cerebras Wafer-Scale Integration: The Cerebras Story (Lauterbach, 2021)
Network Fabrics (Topologies & Protocols)
- A Scalable, Commodity Data Center Network Architecture (Fat-Tree) (Al-Fares et al., 2008)
- Technology-Driven, Highly-Scalable Dragonfly Topology (Kim et al., 2008)
- Jellyfish: Networking Data Centers Randomly (Singla et al., 2012)
- Congestion Control for Large-Scale RDMA Deployments (DCQCN) (Zhu et al., 2015)
- HPCC: High Precision Congestion Control (Li et al., 2019)
- Swift: Delay is Simple and Effective for Congestion Control (Kumar et al., 2020) - Google's Swift protocol.
Memory & Interconnect Standards
- HBM3: Enabling Memory Resilience at Scale (Standardization Papers)
- Compute Express Link (CXL): A Comprehensive Survey (Lian et al., 2024)
- Next-Gen Interconnection Systems with CXL (2024)
Part III: Deployment & Optimization (The Serving Layer)
Inference & Serving
- Efficient Memory Management for LLM Serving with PagedAttention (vLLM) (Kwon et al., 2023)
- Orca: A Distributed Serving System for [Transformer] Models (Yu et al., 2022) - Continuous Batching.
- FlexFlow: A Distributed Deep Learning Framework (Jia et al., 2019)
Performance Engineering (Quantization & Compression)
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022)
- SmoothQuant: Accurate and Efficient Post-Training Quantization (Xiao et al., 2023)
- FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al., 2022)
Part IV: The Vanguard (The Future of Scale)
Optical & Photonic Systems
- Leveraging Optical Chip-to-chip Connectivity (Ayar Labs, 2023)
- Photonic AI Acceleration: A New Kind of Computer (Lightmatter, 2025)
- Panel-Scale Reconfigurable Photonic Interconnects (Hsueh et al., 2025)
Core Textbooks (Strategic Guides)
- Computer Architecture: A Quantitative Approach (Hennessy & Patterson) - The architectural Bible.
- Designing Machine Learning Systems (Chip Huyen, 2022)
- Designing Data-Intensive Applications (Martin Kleppmann, 2017)
- Distributed Systems: Principles and Paradigms (Tanenbaum & van Steen)