mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-30 01:29:07 -05:00
Replace every hand-typed derived number with Python-computed inline references. Add just-in-time compute cells before prose so that changing any input constant automatically propagates to all derived values. Vol 1 chapters fixed: dl_primer, dnn_architectures, serving, model_compression, hw_acceleration, benchmarking, ops, appendix_machine, appendix_data, frameworks, data_engineering, training, ml_systems, responsible_engr, data_selection, workflow, introduction. Vol 2 chapters fixed: distributed_training, inference, infrastructure, storage, sustainable_ai, fault_tolerance, ops_scale, edge_intelligence, ai_for_good, privacy_security. Key corrections caught by forcing computation: - training.qmd carbon footprint: 64 GPUs → 1024 GPUs (original was mathematically impossible for 7B params × 1T tokens) - hw_acceleration.qmd systolic energy: 10 pJ/250× → 11 pJ/233× (exact) - hw_acceleration.qmd GPT-2 utilization: 0.6% → 0.7% (exact) - serving.qmd tokens/hour: ~190M → ~192M (exact) Also adds calc/validate_inline_refs.py pre-render guardrail and extends calc/viz.py with Harvard Crimson plotting palette.
278 lines
13 KiB
Markdown
278 lines
13 KiB
Markdown
# Volume 1 Seminal Papers Corpus
|
|
|
|
This document defines the **core corpus of papers** that should be cited in each chapter, with justification for why each is seminal.
|
|
|
|
Generated: January 29, 2026
|
|
|
|
---
|
|
|
|
## How to Use This Document
|
|
|
|
For each chapter:
|
|
1. Check if the paper is already cited
|
|
2. If not cited but topic is discussed → ADD the citation
|
|
3. If topic is not discussed → SKIP (don't force citations)
|
|
|
|
---
|
|
|
|
## Chapter 1: Introduction
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| Computing Machinery and Intelligence | Turing | 1950 | Introduced Turing Test, framed machine intelligence |
|
|
| A Proposal for the Dartmouth Summer Research Project | McCarthy et al. | 1955 | Coined "artificial intelligence", launched AI as field |
|
|
| The Perceptron | Rosenblatt | 1957 | First learning algorithm that adjusts weights from data |
|
|
| Perceptrons: An Introduction to Computational Geometry | Minsky & Papert | 1969 | Proved perceptron limitations, caused first AI winter |
|
|
| Learning Representations by Back-Propagating Errors | Rumelhart, Hinton, Williams | 1986 | Popularized backpropagation, enabled deep learning |
|
|
| ImageNet Classification with Deep CNNs (AlexNet) | Krizhevsky et al. | 2012 | Sparked deep learning revolution |
|
|
| Software 2.0 | Karpathy | 2017 | Framed shift from code to learned models |
|
|
| The Bitter Lesson | Sutton | 2019 | Showed computation beats encoded expertise |
|
|
| Hidden Technical Debt in ML Systems | Sculley et al. | 2015 | Established ML systems engineering as discipline |
|
|
| AI and Compute | Amodei & Hernandez | 2018 | Quantified exponential growth in AI compute |
|
|
|
|
---
|
|
|
|
## Chapter 2: ML Systems
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| In-Datacenter Performance Analysis of a TPU | Jouppi et al. | 2017 | First TPU disclosure, established domain-specific accelerators |
|
|
| Hitting the Memory Wall | Wulf & McKee | 1995 | Coined "memory wall", identified fundamental bottleneck |
|
|
| MobileNets | Howard et al. | 2017 | Enabled efficient mobile deployment |
|
|
| Communication-Efficient Learning (FedAvg) | McMahan et al. | 2017 | Established federated learning |
|
|
| Widening Access to Applied ML with TinyML | Reddi et al. | 2022 | Democratized ML on resource-constrained devices |
|
|
| MLPerf Tiny Benchmark | Banbury, Reddi et al. | 2021 | First benchmark for microcontroller ML |
|
|
| Deep Learning Recommendation Model (DLRM) | Naumov et al. | 2019 | Industry-standard recommendation architecture |
|
|
| Roofline Model | Williams et al. | 2009 | Framework for compute vs memory-bound analysis |
|
|
|
|
---
|
|
|
|
## Chapter 3: Neural Computation
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| Learning Representations by Back-Propagating Errors | Rumelhart et al. | 1986 | Standard training algorithm |
|
|
| Rectified Linear Units Improve RBMs | Nair & Hinton | 2010 | Established ReLU as default activation |
|
|
| Adam: A Method for Stochastic Optimization | Kingma & Ba | 2014 | Default optimizer for most applications |
|
|
| Dropout: Preventing Overfitting | Srivastava et al. | 2014 | Standard regularization technique |
|
|
| Batch Normalization | Ioffe & Szegedy | 2015 | Enables faster, stable training |
|
|
| Understanding Difficulty of Training Deep Networks | Glorot & Bengio | 2010 | Xavier/Glorot initialization |
|
|
| Deep Learning (Nature) | LeCun, Bengio, Hinton | 2015 | Landmark review marking mainstream acceptance |
|
|
| Approximation by Superpositions of Sigmoidal Function | Cybenko | 1989 | Universal approximation theorem |
|
|
| Delving Deep into Rectifiers (He Init) | He et al. | 2015 | Initialization for ReLU networks |
|
|
|
|
---
|
|
|
|
## Chapter 4: Network Architectures
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| Gradient-based Learning (LeNet) | LeCun et al. | 1998 | First successful CNN |
|
|
| ImageNet Classification (AlexNet) | Krizhevsky et al. | 2012 | Deep learning breakthrough |
|
|
| Very Deep CNNs (VGGNet) | Simonyan & Zisserman | 2014 | Showed depth improves performance |
|
|
| Going Deeper with Convolutions (GoogLeNet) | Szegedy et al. | 2015 | Multi-scale inception modules |
|
|
| Deep Residual Learning (ResNet) | He et al. | 2016 | Skip connections enabled 100+ layers |
|
|
| Densely Connected CNNs (DenseNet) | Huang et al. | 2017 | Feature reuse through dense connectivity |
|
|
| Long Short-Term Memory | Hochreiter & Schmidhuber | 1997 | Gating for long-term dependencies |
|
|
| GRU | Cho et al. | 2014 | Simpler alternative to LSTM |
|
|
| Neural Machine Translation (Attention) | Bahdanau et al. | 2014 | Introduced attention mechanism |
|
|
| Attention Is All You Need | Vaswani et al. | 2017 | Transformer architecture |
|
|
| BERT | Devlin et al. | 2019 | Bidirectional pre-training paradigm |
|
|
| GPT | Radford et al. | 2018 | Autoregressive pre-training |
|
|
| Vision Transformer (ViT) | Dosovitskiy et al. | 2021 | Transformers for vision |
|
|
| Layer Normalization | Ba et al. | 2016 | Essential for transformers |
|
|
|
|
---
|
|
|
|
## Chapter 5: ML Frameworks
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| TensorFlow | Abadi et al. | 2016 | Static graph execution model |
|
|
| PyTorch | Paszke et al. | 2019 | Dynamic graph, define-by-run |
|
|
| JAX/Autograd | Frostig et al. / Bradbury et al. | 2018 | Functional transformations |
|
|
| Theano | Bergstra et al. | 2010 | First symbolic computation + autodiff |
|
|
| Automatic Differentiation Survey | Baydin et al. | 2018 | Definitive autodiff reference |
|
|
| cuDNN | Chetlur et al. | 2014 | GPU primitives foundation |
|
|
| BLAS | Lawson et al. | 1979 | Linear algebra interface standard |
|
|
| Training with Sublinear Memory | Chen et al. | 2016 | Gradient checkpointing |
|
|
|
|
---
|
|
|
|
## Chapter 6: Model Training
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| Learning Representations by Back-Propagating Errors | Rumelhart et al. | 1986 | Core training algorithm |
|
|
| Mixed Precision Training | Micikevicius et al. | 2017 | FP16/FP32 training |
|
|
| Training with Sublinear Memory | Chen et al. | 2016 | Gradient checkpointing |
|
|
| FlashAttention | Dao et al. | 2022 | IO-aware attention, O(n) memory |
|
|
| Accurate, Large Minibatch SGD | Goyal et al. | 2017 | Linear scaling rule for large batches |
|
|
| Large Scale Distributed Deep Networks | Dean et al. | 2012 | Parameter server architecture |
|
|
| Horovod | Sergeev & Del Balso | 2018 | Ring AllReduce for distributed training |
|
|
| SGDR: Warm Restarts | Loshchilov & Hutter | 2016 | Cosine annealing schedule |
|
|
|
|
---
|
|
|
|
## Chapter 7: Hardware Acceleration
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| Scalable Parallel Programming with CUDA | Nickolls et al. | 2008 | GPU computing model |
|
|
| cuDNN | Chetlur et al. | 2014 | GPU deep learning primitives |
|
|
| In-Datacenter TPU Analysis | Jouppi et al. | 2017 | TPU architecture |
|
|
| Ten Lessons from Three TPU Generations | Jouppi et al. | 2021 | TPU evolution |
|
|
| Systolic Arrays for VLSI | Kung & Leiserson | 1979 | Systolic array concept |
|
|
| Why Systolic Architectures? | Kung | 1982 | Systolic design principles |
|
|
| Eyeriss | Chen et al. | 2016 | Dataflow taxonomy (weight/output/input stationary) |
|
|
| TVM | Chen et al. | 2018 | ML compiler with auto-tuning |
|
|
| MLIR | Lattner et al. | 2019 | Multi-level IR for ML |
|
|
| Roofline Model | Williams et al. | 2009 | Compute vs memory-bound analysis |
|
|
| Efficient Processing of DNNs Survey | Sze et al. | 2017 | Comprehensive accelerator survey |
|
|
|
|
---
|
|
|
|
## Chapter 8: Model Compression
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| Quantization and Training for Efficient Inference | Jacob et al. | 2018 | Standard INT8 quantization |
|
|
| Deep Compression | Han et al. | 2015 | Pruning + quantization pipeline |
|
|
| Optimal Brain Damage | LeCun et al. | 1989 | First pruning formalization |
|
|
| Pruning Filters for Efficient ConvNets | Li et al. | 2017 | Structured pruning |
|
|
| Distilling Knowledge in a Neural Network | Hinton et al. | 2015 | Knowledge distillation |
|
|
| Neural Architecture Search with RL | Zoph & Le | 2017 | Automated architecture discovery |
|
|
| DARTS | Liu et al. | 2019 | Differentiable NAS |
|
|
| MobileNets | Howard et al. | 2017 | Depthwise separable convolutions |
|
|
| EfficientNet | Tan & Le | 2019 | Compound scaling |
|
|
| Lottery Ticket Hypothesis | Frankle & Carlin | 2019 | Sparse trainable subnetworks |
|
|
|
|
---
|
|
|
|
## Chapter 9: Benchmarking
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| MLPerf Training Benchmark | Mattson et al. | 2020 | Industry standard training benchmark |
|
|
| MLPerf Inference Benchmark | Reddi et al. | 2020 | Standardized inference evaluation |
|
|
| MLPerf Tiny Benchmark | Banbury et al. | 2021 | Microcontroller ML benchmark |
|
|
| DAWNBench | Coleman et al. | 2017 | Time-to-accuracy evaluation |
|
|
| ImageNet | Deng et al. | 2009 | Standard vision benchmark |
|
|
| COCO | Lin et al. | 2014 | Detection/segmentation benchmark |
|
|
| SQuAD | Rajpurkar et al. | 2016 | Reading comprehension benchmark |
|
|
| GLUE | Wang et al. | 2018 | Multi-task NLP benchmark |
|
|
|
|
---
|
|
|
|
## Chapter 10: Model Serving
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| TensorFlow Serving | Olston et al. | 2017 | Dynamic batching, model serving architecture |
|
|
| Clipper | Crankshaw et al. | 2017 | Low-latency prediction serving |
|
|
| The Tail at Scale | Dean & Barroso | 2013 | Tail latency in distributed systems |
|
|
| Orca | Yu et al. | 2022 | Continuous batching for LLMs |
|
|
| vLLM (PagedAttention) | Kwon et al. | 2023 | KV cache memory management |
|
|
| FlashAttention | Dao et al. | 2022 | Efficient attention for inference |
|
|
| Nexus | Shen et al. | 2019 | GPU cluster for DNN serving |
|
|
| Little's Law | Little | 1961 | Queuing theory foundation |
|
|
|
|
---
|
|
|
|
## Chapter 11: Data Engineering
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| Data Cascades in High-Stakes AI | Sambasivan et al. | 2021 | Data quality as engineering concern |
|
|
| Hidden Technical Debt in ML Systems | Sculley et al. | 2015 | Training-serving skew |
|
|
| Datasheets for Datasets | Gebru et al. | 2021 | Dataset documentation standard |
|
|
| Survey on Concept Drift Adaptation | Gama et al. | 2014 | Drift detection taxonomy |
|
|
| Cheap and Fast—But is it Good? | Snow et al. | 2008 | Crowdsourcing quality |
|
|
|
|
---
|
|
|
|
## Chapter 12: Data Efficiency
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| Scaling Laws for Neural Language Models | Kaplan et al. | 2020 | Power-law scaling relationships |
|
|
| Training Compute-Optimal LLMs (Chinchilla) | Hoffmann et al. | 2022 | Optimal data-to-parameter ratios |
|
|
| Curriculum Learning | Bengio et al. | 2009 | Easy-to-hard training order |
|
|
| Active Learning | Settles | 2009 | Query strategies book |
|
|
| FixMatch | Sohn et al. | 2020 | Semi-supervised learning |
|
|
| SimCLR | Chen et al. | 2020 | Contrastive self-supervised learning |
|
|
| MoCo | He et al. | 2020 | Momentum contrastive learning |
|
|
| mixup | Zhang et al. | 2018 | Data augmentation |
|
|
|
|
---
|
|
|
|
## Chapter 13: ML Operations
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| Hidden Technical Debt in ML Systems | Sculley et al. | 2015 | ML technical debt framework |
|
|
| Software Engineering for ML | Amershi et al. | 2019 | ML-specific SE practices |
|
|
| ML Test Score | Breck et al. | 2017 | Production readiness rubric |
|
|
| TFX | Baylor et al. | 2017 | End-to-end ML platform |
|
|
| MLflow | Zaharia et al. | 2018 | Experiment tracking standard |
|
|
|
|
---
|
|
|
|
## Chapter 14: Responsible Engineering
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| Model Cards for Model Reporting | Mitchell et al. | 2019 | Model documentation standard |
|
|
| Datasheets for Datasets | Gebru et al. | 2021 | Dataset documentation |
|
|
| "Why Should I Trust You?" (LIME) | Ribeiro et al. | 2016 | Model-agnostic explanations |
|
|
| SHAP | Lundberg & Lee | 2017 | Game-theoretic feature attribution |
|
|
| Gender Shades | Buolamwini & Gebru | 2018 | Bias audit methodology |
|
|
| Equality of Opportunity | Hardt et al. | 2016 | Fairness definitions |
|
|
| Inherent Trade-Offs in Fair Risk Scores | Kleinberg et al. | 2016 | Fairness impossibility results |
|
|
| Big Data's Disparate Impact | Barocas & Selbst | 2016 | Legal framework for algorithmic discrimination |
|
|
|
|
---
|
|
|
|
## Chapter 15: ML Workflow
|
|
|
|
| Paper | Authors | Year | Why Seminal |
|
|
|-------|---------|------|-------------|
|
|
| From Data Mining to KDD | Fayyad et al. | 1996 | KDD process methodology |
|
|
| CRISP-DM | Chapman et al. | 2000 | Industry-standard ML workflow |
|
|
| Software Engineering for ML | Amershi et al. | 2019 | ML lifecycle principles |
|
|
|
|
---
|
|
|
|
## Summary Statistics
|
|
|
|
| Chapter | Seminal Papers Listed |
|
|
|---------|----------------------|
|
|
| Introduction | 10 |
|
|
| ML Systems | 8 |
|
|
| Neural Computation | 9 |
|
|
| Network Architectures | 14 |
|
|
| ML Frameworks | 8 |
|
|
| Model Training | 8 |
|
|
| Hardware Acceleration | 11 |
|
|
| Model Compression | 10 |
|
|
| Benchmarking | 8 |
|
|
| Model Serving | 8 |
|
|
| Data Engineering | 5 |
|
|
| Data Efficiency | 8 |
|
|
| ML Operations | 5 |
|
|
| Responsible Engr | 8 |
|
|
| ML Workflow | 3 |
|
|
| **TOTAL** | **~113 unique papers** |
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. Cross-check each chapter against this corpus
|
|
2. Add missing citations where topics are discussed
|
|
3. Remove any citations that aren't justified by this list (clutter)
|
|
|
|
---
|
|
|
|
*This corpus represents the foundational literature for ML systems. Each paper was selected because it introduced a concept, technique, or result that shaped the field.*
|