Files
cs249r_book/SEMINAL_PAPERS_CORPUS.md
Vijay Janapa Reddi 3750ee12e9 Enforce Computed Arithmetic Rule across all chapters (1,064 inline refs, 0 unresolved)
Replace every hand-typed derived number with Python-computed inline
references. Add just-in-time compute cells before prose so that changing
any input constant automatically propagates to all derived values.

Vol 1 chapters fixed: dl_primer, dnn_architectures, serving,
model_compression, hw_acceleration, benchmarking, ops, appendix_machine,
appendix_data, frameworks, data_engineering, training, ml_systems,
responsible_engr, data_selection, workflow, introduction.

Vol 2 chapters fixed: distributed_training, inference, infrastructure,
storage, sustainable_ai, fault_tolerance, ops_scale, edge_intelligence,
ai_for_good, privacy_security.

Key corrections caught by forcing computation:
- training.qmd carbon footprint: 64 GPUs → 1024 GPUs (original was
  mathematically impossible for 7B params × 1T tokens)
- hw_acceleration.qmd systolic energy: 10 pJ/250× → 11 pJ/233× (exact)
- hw_acceleration.qmd GPT-2 utilization: 0.6% → 0.7% (exact)
- serving.qmd tokens/hour: ~190M → ~192M (exact)

Also adds calc/validate_inline_refs.py pre-render guardrail and
extends calc/viz.py with Harvard Crimson plotting palette.
2026-02-01 11:13:42 -05:00

13 KiB

Volume 1 Seminal Papers Corpus

This document defines the core corpus of papers that should be cited in each chapter, with justification for why each is seminal.

Generated: January 29, 2026


How to Use This Document

For each chapter:

  1. Check if the paper is already cited
  2. If not cited but topic is discussed → ADD the citation
  3. If topic is not discussed → SKIP (don't force citations)

Chapter 1: Introduction

Paper Authors Year Why Seminal
Computing Machinery and Intelligence Turing 1950 Introduced Turing Test, framed machine intelligence
A Proposal for the Dartmouth Summer Research Project McCarthy et al. 1955 Coined "artificial intelligence", launched AI as field
The Perceptron Rosenblatt 1957 First learning algorithm that adjusts weights from data
Perceptrons: An Introduction to Computational Geometry Minsky & Papert 1969 Proved perceptron limitations, caused first AI winter
Learning Representations by Back-Propagating Errors Rumelhart, Hinton, Williams 1986 Popularized backpropagation, enabled deep learning
ImageNet Classification with Deep CNNs (AlexNet) Krizhevsky et al. 2012 Sparked deep learning revolution
Software 2.0 Karpathy 2017 Framed shift from code to learned models
The Bitter Lesson Sutton 2019 Showed computation beats encoded expertise
Hidden Technical Debt in ML Systems Sculley et al. 2015 Established ML systems engineering as discipline
AI and Compute Amodei & Hernandez 2018 Quantified exponential growth in AI compute

Chapter 2: ML Systems

Paper Authors Year Why Seminal
In-Datacenter Performance Analysis of a TPU Jouppi et al. 2017 First TPU disclosure, established domain-specific accelerators
Hitting the Memory Wall Wulf & McKee 1995 Coined "memory wall", identified fundamental bottleneck
MobileNets Howard et al. 2017 Enabled efficient mobile deployment
Communication-Efficient Learning (FedAvg) McMahan et al. 2017 Established federated learning
Widening Access to Applied ML with TinyML Reddi et al. 2022 Democratized ML on resource-constrained devices
MLPerf Tiny Benchmark Banbury, Reddi et al. 2021 First benchmark for microcontroller ML
Deep Learning Recommendation Model (DLRM) Naumov et al. 2019 Industry-standard recommendation architecture
Roofline Model Williams et al. 2009 Framework for compute vs memory-bound analysis

Chapter 3: Neural Computation

Paper Authors Year Why Seminal
Learning Representations by Back-Propagating Errors Rumelhart et al. 1986 Standard training algorithm
Rectified Linear Units Improve RBMs Nair & Hinton 2010 Established ReLU as default activation
Adam: A Method for Stochastic Optimization Kingma & Ba 2014 Default optimizer for most applications
Dropout: Preventing Overfitting Srivastava et al. 2014 Standard regularization technique
Batch Normalization Ioffe & Szegedy 2015 Enables faster, stable training
Understanding Difficulty of Training Deep Networks Glorot & Bengio 2010 Xavier/Glorot initialization
Deep Learning (Nature) LeCun, Bengio, Hinton 2015 Landmark review marking mainstream acceptance
Approximation by Superpositions of Sigmoidal Function Cybenko 1989 Universal approximation theorem
Delving Deep into Rectifiers (He Init) He et al. 2015 Initialization for ReLU networks

Chapter 4: Network Architectures

Paper Authors Year Why Seminal
Gradient-based Learning (LeNet) LeCun et al. 1998 First successful CNN
ImageNet Classification (AlexNet) Krizhevsky et al. 2012 Deep learning breakthrough
Very Deep CNNs (VGGNet) Simonyan & Zisserman 2014 Showed depth improves performance
Going Deeper with Convolutions (GoogLeNet) Szegedy et al. 2015 Multi-scale inception modules
Deep Residual Learning (ResNet) He et al. 2016 Skip connections enabled 100+ layers
Densely Connected CNNs (DenseNet) Huang et al. 2017 Feature reuse through dense connectivity
Long Short-Term Memory Hochreiter & Schmidhuber 1997 Gating for long-term dependencies
GRU Cho et al. 2014 Simpler alternative to LSTM
Neural Machine Translation (Attention) Bahdanau et al. 2014 Introduced attention mechanism
Attention Is All You Need Vaswani et al. 2017 Transformer architecture
BERT Devlin et al. 2019 Bidirectional pre-training paradigm
GPT Radford et al. 2018 Autoregressive pre-training
Vision Transformer (ViT) Dosovitskiy et al. 2021 Transformers for vision
Layer Normalization Ba et al. 2016 Essential for transformers

Chapter 5: ML Frameworks

Paper Authors Year Why Seminal
TensorFlow Abadi et al. 2016 Static graph execution model
PyTorch Paszke et al. 2019 Dynamic graph, define-by-run
JAX/Autograd Frostig et al. / Bradbury et al. 2018 Functional transformations
Theano Bergstra et al. 2010 First symbolic computation + autodiff
Automatic Differentiation Survey Baydin et al. 2018 Definitive autodiff reference
cuDNN Chetlur et al. 2014 GPU primitives foundation
BLAS Lawson et al. 1979 Linear algebra interface standard
Training with Sublinear Memory Chen et al. 2016 Gradient checkpointing

Chapter 6: Model Training

Paper Authors Year Why Seminal
Learning Representations by Back-Propagating Errors Rumelhart et al. 1986 Core training algorithm
Mixed Precision Training Micikevicius et al. 2017 FP16/FP32 training
Training with Sublinear Memory Chen et al. 2016 Gradient checkpointing
FlashAttention Dao et al. 2022 IO-aware attention, O(n) memory
Accurate, Large Minibatch SGD Goyal et al. 2017 Linear scaling rule for large batches
Large Scale Distributed Deep Networks Dean et al. 2012 Parameter server architecture
Horovod Sergeev & Del Balso 2018 Ring AllReduce for distributed training
SGDR: Warm Restarts Loshchilov & Hutter 2016 Cosine annealing schedule

Chapter 7: Hardware Acceleration

Paper Authors Year Why Seminal
Scalable Parallel Programming with CUDA Nickolls et al. 2008 GPU computing model
cuDNN Chetlur et al. 2014 GPU deep learning primitives
In-Datacenter TPU Analysis Jouppi et al. 2017 TPU architecture
Ten Lessons from Three TPU Generations Jouppi et al. 2021 TPU evolution
Systolic Arrays for VLSI Kung & Leiserson 1979 Systolic array concept
Why Systolic Architectures? Kung 1982 Systolic design principles
Eyeriss Chen et al. 2016 Dataflow taxonomy (weight/output/input stationary)
TVM Chen et al. 2018 ML compiler with auto-tuning
MLIR Lattner et al. 2019 Multi-level IR for ML
Roofline Model Williams et al. 2009 Compute vs memory-bound analysis
Efficient Processing of DNNs Survey Sze et al. 2017 Comprehensive accelerator survey

Chapter 8: Model Compression

Paper Authors Year Why Seminal
Quantization and Training for Efficient Inference Jacob et al. 2018 Standard INT8 quantization
Deep Compression Han et al. 2015 Pruning + quantization pipeline
Optimal Brain Damage LeCun et al. 1989 First pruning formalization
Pruning Filters for Efficient ConvNets Li et al. 2017 Structured pruning
Distilling Knowledge in a Neural Network Hinton et al. 2015 Knowledge distillation
Neural Architecture Search with RL Zoph & Le 2017 Automated architecture discovery
DARTS Liu et al. 2019 Differentiable NAS
MobileNets Howard et al. 2017 Depthwise separable convolutions
EfficientNet Tan & Le 2019 Compound scaling
Lottery Ticket Hypothesis Frankle & Carlin 2019 Sparse trainable subnetworks

Chapter 9: Benchmarking

Paper Authors Year Why Seminal
MLPerf Training Benchmark Mattson et al. 2020 Industry standard training benchmark
MLPerf Inference Benchmark Reddi et al. 2020 Standardized inference evaluation
MLPerf Tiny Benchmark Banbury et al. 2021 Microcontroller ML benchmark
DAWNBench Coleman et al. 2017 Time-to-accuracy evaluation
ImageNet Deng et al. 2009 Standard vision benchmark
COCO Lin et al. 2014 Detection/segmentation benchmark
SQuAD Rajpurkar et al. 2016 Reading comprehension benchmark
GLUE Wang et al. 2018 Multi-task NLP benchmark

Chapter 10: Model Serving

Paper Authors Year Why Seminal
TensorFlow Serving Olston et al. 2017 Dynamic batching, model serving architecture
Clipper Crankshaw et al. 2017 Low-latency prediction serving
The Tail at Scale Dean & Barroso 2013 Tail latency in distributed systems
Orca Yu et al. 2022 Continuous batching for LLMs
vLLM (PagedAttention) Kwon et al. 2023 KV cache memory management
FlashAttention Dao et al. 2022 Efficient attention for inference
Nexus Shen et al. 2019 GPU cluster for DNN serving
Little's Law Little 1961 Queuing theory foundation

Chapter 11: Data Engineering

Paper Authors Year Why Seminal
Data Cascades in High-Stakes AI Sambasivan et al. 2021 Data quality as engineering concern
Hidden Technical Debt in ML Systems Sculley et al. 2015 Training-serving skew
Datasheets for Datasets Gebru et al. 2021 Dataset documentation standard
Survey on Concept Drift Adaptation Gama et al. 2014 Drift detection taxonomy
Cheap and Fast—But is it Good? Snow et al. 2008 Crowdsourcing quality

Chapter 12: Data Efficiency

Paper Authors Year Why Seminal
Scaling Laws for Neural Language Models Kaplan et al. 2020 Power-law scaling relationships
Training Compute-Optimal LLMs (Chinchilla) Hoffmann et al. 2022 Optimal data-to-parameter ratios
Curriculum Learning Bengio et al. 2009 Easy-to-hard training order
Active Learning Settles 2009 Query strategies book
FixMatch Sohn et al. 2020 Semi-supervised learning
SimCLR Chen et al. 2020 Contrastive self-supervised learning
MoCo He et al. 2020 Momentum contrastive learning
mixup Zhang et al. 2018 Data augmentation

Chapter 13: ML Operations

Paper Authors Year Why Seminal
Hidden Technical Debt in ML Systems Sculley et al. 2015 ML technical debt framework
Software Engineering for ML Amershi et al. 2019 ML-specific SE practices
ML Test Score Breck et al. 2017 Production readiness rubric
TFX Baylor et al. 2017 End-to-end ML platform
MLflow Zaharia et al. 2018 Experiment tracking standard

Chapter 14: Responsible Engineering

Paper Authors Year Why Seminal
Model Cards for Model Reporting Mitchell et al. 2019 Model documentation standard
Datasheets for Datasets Gebru et al. 2021 Dataset documentation
"Why Should I Trust You?" (LIME) Ribeiro et al. 2016 Model-agnostic explanations
SHAP Lundberg & Lee 2017 Game-theoretic feature attribution
Gender Shades Buolamwini & Gebru 2018 Bias audit methodology
Equality of Opportunity Hardt et al. 2016 Fairness definitions
Inherent Trade-Offs in Fair Risk Scores Kleinberg et al. 2016 Fairness impossibility results
Big Data's Disparate Impact Barocas & Selbst 2016 Legal framework for algorithmic discrimination

Chapter 15: ML Workflow

Paper Authors Year Why Seminal
From Data Mining to KDD Fayyad et al. 1996 KDD process methodology
CRISP-DM Chapman et al. 2000 Industry-standard ML workflow
Software Engineering for ML Amershi et al. 2019 ML lifecycle principles

Summary Statistics

Chapter Seminal Papers Listed
Introduction 10
ML Systems 8
Neural Computation 9
Network Architectures 14
ML Frameworks 8
Model Training 8
Hardware Acceleration 11
Model Compression 10
Benchmarking 8
Model Serving 8
Data Engineering 5
Data Efficiency 8
ML Operations 5
Responsible Engr 8
ML Workflow 3
TOTAL ~113 unique papers

Next Steps

  1. Cross-check each chapter against this corpus
  2. Add missing citations where topics are discussed
  3. Remove any citations that aren't justified by this list (clutter)

This corpus represents the foundational literature for ML systems. Each paper was selected because it introduced a concept, technique, or result that shaped the field.