cs249r_book/SEMINAL_PAPERS_CORPUS.md

# Volume 1 Seminal Papers Corpus

This document defines the **core corpus of papers** that should be cited in each chapter, with justification for why each is seminal.

Generated: January 29, 2026

---

## How to Use This Document

For each chapter:
1. Check if the paper is already cited
2. If not cited but topic is discussed → ADD the citation
3. If topic is not discussed → SKIP (don't force citations)

---

## Chapter 1: Introduction

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| Computing Machinery and Intelligence | Turing | 1950 | Introduced Turing Test, framed machine intelligence |
| A Proposal for the Dartmouth Summer Research Project | McCarthy et al. | 1955 | Coined "artificial intelligence", launched AI as field |
| The Perceptron | Rosenblatt | 1957 | First learning algorithm that adjusts weights from data |
| Perceptrons: An Introduction to Computational Geometry | Minsky & Papert | 1969 | Proved perceptron limitations, caused first AI winter |
| Learning Representations by Back-Propagating Errors | Rumelhart, Hinton, Williams | 1986 | Popularized backpropagation, enabled deep learning |
| ImageNet Classification with Deep CNNs (AlexNet) | Krizhevsky et al. | 2012 | Sparked deep learning revolution |
| Software 2.0 | Karpathy | 2017 | Framed shift from code to learned models |
| The Bitter Lesson | Sutton | 2019 | Showed computation beats encoded expertise |
| Hidden Technical Debt in ML Systems | Sculley et al. | 2015 | Established ML systems engineering as discipline |
| AI and Compute | Amodei & Hernandez | 2018 | Quantified exponential growth in AI compute |

---

## Chapter 2: ML Systems

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| In-Datacenter Performance Analysis of a TPU | Jouppi et al. | 2017 | First TPU disclosure, established domain-specific accelerators |
| Hitting the Memory Wall | Wulf & McKee | 1995 | Coined "memory wall", identified fundamental bottleneck |
| MobileNets | Howard et al. | 2017 | Enabled efficient mobile deployment |
| Communication-Efficient Learning (FedAvg) | McMahan et al. | 2017 | Established federated learning |
| Widening Access to Applied ML with TinyML | Reddi et al. | 2022 | Democratized ML on resource-constrained devices |
| MLPerf Tiny Benchmark | Banbury, Reddi et al. | 2021 | First benchmark for microcontroller ML |
| Deep Learning Recommendation Model (DLRM) | Naumov et al. | 2019 | Industry-standard recommendation architecture |
| Roofline Model | Williams et al. | 2009 | Framework for compute vs memory-bound analysis |

---

## Chapter 3: Neural Computation

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| Learning Representations by Back-Propagating Errors | Rumelhart et al. | 1986 | Standard training algorithm |
| Rectified Linear Units Improve RBMs | Nair & Hinton | 2010 | Established ReLU as default activation |
| Adam: A Method for Stochastic Optimization | Kingma & Ba | 2014 | Default optimizer for most applications |
| Dropout: Preventing Overfitting | Srivastava et al. | 2014 | Standard regularization technique |
| Batch Normalization | Ioffe & Szegedy | 2015 | Enables faster, stable training |
| Understanding Difficulty of Training Deep Networks | Glorot & Bengio | 2010 | Xavier/Glorot initialization |
| Deep Learning (Nature) | LeCun, Bengio, Hinton | 2015 | Landmark review marking mainstream acceptance |
| Approximation by Superpositions of Sigmoidal Function | Cybenko | 1989 | Universal approximation theorem |
| Delving Deep into Rectifiers (He Init) | He et al. | 2015 | Initialization for ReLU networks |

---

## Chapter 4: Network Architectures

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| Gradient-based Learning (LeNet) | LeCun et al. | 1998 | First successful CNN |
| ImageNet Classification (AlexNet) | Krizhevsky et al. | 2012 | Deep learning breakthrough |
| Very Deep CNNs (VGGNet) | Simonyan & Zisserman | 2014 | Showed depth improves performance |
| Going Deeper with Convolutions (GoogLeNet) | Szegedy et al. | 2015 | Multi-scale inception modules |
| Deep Residual Learning (ResNet) | He et al. | 2016 | Skip connections enabled 100+ layers |
| Densely Connected CNNs (DenseNet) | Huang et al. | 2017 | Feature reuse through dense connectivity |
| Long Short-Term Memory | Hochreiter & Schmidhuber | 1997 | Gating for long-term dependencies |
| GRU | Cho et al. | 2014 | Simpler alternative to LSTM |
| Neural Machine Translation (Attention) | Bahdanau et al. | 2014 | Introduced attention mechanism |
| Attention Is All You Need | Vaswani et al. | 2017 | Transformer architecture |
| BERT | Devlin et al. | 2019 | Bidirectional pre-training paradigm |
| GPT | Radford et al. | 2018 | Autoregressive pre-training |
| Vision Transformer (ViT) | Dosovitskiy et al. | 2021 | Transformers for vision |
| Layer Normalization | Ba et al. | 2016 | Essential for transformers |

---

## Chapter 5: ML Frameworks

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| TensorFlow | Abadi et al. | 2016 | Static graph execution model |
| PyTorch | Paszke et al. | 2019 | Dynamic graph, define-by-run |
| JAX/Autograd | Frostig et al. / Bradbury et al. | 2018 | Functional transformations |
| Theano | Bergstra et al. | 2010 | First symbolic computation + autodiff |
| Automatic Differentiation Survey | Baydin et al. | 2018 | Definitive autodiff reference |
| cuDNN | Chetlur et al. | 2014 | GPU primitives foundation |
| BLAS | Lawson et al. | 1979 | Linear algebra interface standard |
| Training with Sublinear Memory | Chen et al. | 2016 | Gradient checkpointing |

---

## Chapter 6: Model Training

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| Learning Representations by Back-Propagating Errors | Rumelhart et al. | 1986 | Core training algorithm |
| Mixed Precision Training | Micikevicius et al. | 2017 | FP16/FP32 training |
| Training with Sublinear Memory | Chen et al. | 2016 | Gradient checkpointing |
| FlashAttention | Dao et al. | 2022 | IO-aware attention, O(n) memory |
| Accurate, Large Minibatch SGD | Goyal et al. | 2017 | Linear scaling rule for large batches |
| Large Scale Distributed Deep Networks | Dean et al. | 2012 | Parameter server architecture |
| Horovod | Sergeev & Del Balso | 2018 | Ring AllReduce for distributed training |
| SGDR: Warm Restarts | Loshchilov & Hutter | 2016 | Cosine annealing schedule |

---

## Chapter 7: Hardware Acceleration

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| Scalable Parallel Programming with CUDA | Nickolls et al. | 2008 | GPU computing model |
| cuDNN | Chetlur et al. | 2014 | GPU deep learning primitives |
| In-Datacenter TPU Analysis | Jouppi et al. | 2017 | TPU architecture |
| Ten Lessons from Three TPU Generations | Jouppi et al. | 2021 | TPU evolution |
| Systolic Arrays for VLSI | Kung & Leiserson | 1979 | Systolic array concept |
| Why Systolic Architectures? | Kung | 1982 | Systolic design principles |
| Eyeriss | Chen et al. | 2016 | Dataflow taxonomy (weight/output/input stationary) |
| TVM | Chen et al. | 2018 | ML compiler with auto-tuning |
| MLIR | Lattner et al. | 2019 | Multi-level IR for ML |
| Roofline Model | Williams et al. | 2009 | Compute vs memory-bound analysis |
| Efficient Processing of DNNs Survey | Sze et al. | 2017 | Comprehensive accelerator survey |

---

## Chapter 8: Model Compression

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| Quantization and Training for Efficient Inference | Jacob et al. | 2018 | Standard INT8 quantization |
| Deep Compression | Han et al. | 2015 | Pruning + quantization pipeline |
| Optimal Brain Damage | LeCun et al. | 1989 | First pruning formalization |
| Pruning Filters for Efficient ConvNets | Li et al. | 2017 | Structured pruning |
| Distilling Knowledge in a Neural Network | Hinton et al. | 2015 | Knowledge distillation |
| Neural Architecture Search with RL | Zoph & Le | 2017 | Automated architecture discovery |
| DARTS | Liu et al. | 2019 | Differentiable NAS |
| MobileNets | Howard et al. | 2017 | Depthwise separable convolutions |
| EfficientNet | Tan & Le | 2019 | Compound scaling |
| Lottery Ticket Hypothesis | Frankle & Carlin | 2019 | Sparse trainable subnetworks |

---

## Chapter 9: Benchmarking

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| MLPerf Training Benchmark | Mattson et al. | 2020 | Industry standard training benchmark |
| MLPerf Inference Benchmark | Reddi et al. | 2020 | Standardized inference evaluation |
| MLPerf Tiny Benchmark | Banbury et al. | 2021 | Microcontroller ML benchmark |
| DAWNBench | Coleman et al. | 2017 | Time-to-accuracy evaluation |
| ImageNet | Deng et al. | 2009 | Standard vision benchmark |
| COCO | Lin et al. | 2014 | Detection/segmentation benchmark |
| SQuAD | Rajpurkar et al. | 2016 | Reading comprehension benchmark |
| GLUE | Wang et al. | 2018 | Multi-task NLP benchmark |

---

## Chapter 10: Model Serving

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| TensorFlow Serving | Olston et al. | 2017 | Dynamic batching, model serving architecture |
| Clipper | Crankshaw et al. | 2017 | Low-latency prediction serving |
| The Tail at Scale | Dean & Barroso | 2013 | Tail latency in distributed systems |
| Orca | Yu et al. | 2022 | Continuous batching for LLMs |
| vLLM (PagedAttention) | Kwon et al. | 2023 | KV cache memory management |
| FlashAttention | Dao et al. | 2022 | Efficient attention for inference |
| Nexus | Shen et al. | 2019 | GPU cluster for DNN serving |
| Little's Law | Little | 1961 | Queuing theory foundation |

---

## Chapter 11: Data Engineering

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| Data Cascades in High-Stakes AI | Sambasivan et al. | 2021 | Data quality as engineering concern |
| Hidden Technical Debt in ML Systems | Sculley et al. | 2015 | Training-serving skew |
| Datasheets for Datasets | Gebru et al. | 2021 | Dataset documentation standard |
| Survey on Concept Drift Adaptation | Gama et al. | 2014 | Drift detection taxonomy |
| Cheap and Fast—But is it Good? | Snow et al. | 2008 | Crowdsourcing quality |

---

## Chapter 12: Data Efficiency

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| Scaling Laws for Neural Language Models | Kaplan et al. | 2020 | Power-law scaling relationships |
| Training Compute-Optimal LLMs (Chinchilla) | Hoffmann et al. | 2022 | Optimal data-to-parameter ratios |
| Curriculum Learning | Bengio et al. | 2009 | Easy-to-hard training order |
| Active Learning | Settles | 2009 | Query strategies book |
| FixMatch | Sohn et al. | 2020 | Semi-supervised learning |
| SimCLR | Chen et al. | 2020 | Contrastive self-supervised learning |
| MoCo | He et al. | 2020 | Momentum contrastive learning |
| mixup | Zhang et al. | 2018 | Data augmentation |

---

## Chapter 13: ML Operations

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| Hidden Technical Debt in ML Systems | Sculley et al. | 2015 | ML technical debt framework |
| Software Engineering for ML | Amershi et al. | 2019 | ML-specific SE practices |
| ML Test Score | Breck et al. | 2017 | Production readiness rubric |
| TFX | Baylor et al. | 2017 | End-to-end ML platform |
| MLflow | Zaharia et al. | 2018 | Experiment tracking standard |

---

## Chapter 14: Responsible Engineering

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| Model Cards for Model Reporting | Mitchell et al. | 2019 | Model documentation standard |
| Datasheets for Datasets | Gebru et al. | 2021 | Dataset documentation |
| "Why Should I Trust You?" (LIME) | Ribeiro et al. | 2016 | Model-agnostic explanations |
| SHAP | Lundberg & Lee | 2017 | Game-theoretic feature attribution |
| Gender Shades | Buolamwini & Gebru | 2018 | Bias audit methodology |
| Equality of Opportunity | Hardt et al. | 2016 | Fairness definitions |
| Inherent Trade-Offs in Fair Risk Scores | Kleinberg et al. | 2016 | Fairness impossibility results |
| Big Data's Disparate Impact | Barocas & Selbst | 2016 | Legal framework for algorithmic discrimination |

---

## Chapter 15: ML Workflow

| Paper | Authors | Year | Why Seminal |
|-------|---------|------|-------------|
| From Data Mining to KDD | Fayyad et al. | 1996 | KDD process methodology |
| CRISP-DM | Chapman et al. | 2000 | Industry-standard ML workflow |
| Software Engineering for ML | Amershi et al. | 2019 | ML lifecycle principles |

---

## Summary Statistics

| Chapter | Seminal Papers Listed |
|---------|----------------------|
| Introduction | 10 |
| ML Systems | 8 |
| Neural Computation | 9 |
| Network Architectures | 14 |
| ML Frameworks | 8 |
| Model Training | 8 |
| Hardware Acceleration | 11 |
| Model Compression | 10 |
| Benchmarking | 8 |
| Model Serving | 8 |
| Data Engineering | 5 |
| Data Efficiency | 8 |
| ML Operations | 5 |
| Responsible Engr | 8 |
| ML Workflow | 3 |
| **TOTAL** | **~113 unique papers** |

---

## Next Steps

1. Cross-check each chapter against this corpus
2. Add missing citations where topics are discussed
3. Remove any citations that aren't justified by this list (clutter)

---

*This corpus represents the foundational literature for ML systems. Each paper was selected because it introduced a concept, technique, or result that shaped the field.*