docs: remove stale volume planning documents

Remove outdated planning documents that have been superseded by VOLUME_STRUCTURE.md which now serves as the authoritative reference for the two-volume textbook organization. Removed files: - VOLUME_SPLIT_ROADMAP.md - VOLUME_SPLIT_SURGICAL_PLAN.md - VOLUME_STRUCTURE_PROPOSAL.md - reviewer-feedback-synthesis-r1.md - volume-outline-draft.md
2026-03-12 02:06:14 -05:00 · 2026-01-03 09:57:16 -05:00
parent f3a38e32d5
commit ddb8068e6b
5 changed files with 0 additions and 2425 deletions
--- a/book/docs/VOLUME_SPLIT_ROADMAP.md
+++ b/book/docs/VOLUME_SPLIT_ROADMAP.md
@@ -1,394 +0,0 @@
-# Machine Learning Systems: Two-Volume Split - Master Roadmap
-
-**Project Start Date**: December 2024
-**Target Completion**: June 2025 (6 months)
-**Goal**: Create flagship two-volume ML Systems textbook series for MIT Press
-
---
-
-## 📋 Project Documents
-
-### Core Planning Documents
-1. **`MIT_PRESS_PROPOSAL.md`** - Official proposal for MIT Press (ready to submit)
-2. **`VOLUME_SPLIT_SURGICAL_PLAN.md`** - Section-by-section surgery instructions (50+ pages)
-3. **`VOLUME_SPLIT_ROADMAP.md`** - This document: master project plan and progress tracker
-4. **`VOLUME_STRUCTURE_PROPOSAL.md`** - Original analysis and proposal
-
-### Supporting Documents
- `VOLUME_SPLIT_ANALYSIS.md` - Deep analysis of pedagogical issues
- `VOLUME_SPLIT_EXECUTIVE_SUMMARY.md` - Quick reference summary
- `DISTRIBUTED_CONTENT_ADDITIONS.md` - Distributed awareness additions for V1
-
---
-
-## 🎯 Project Vision
-
-**Volume I: Introduction to ML Systems** (~1,150-1,200 pages)
- Complete single-system ML engineering
- Includes distributed awareness (not implementation)
- Target: Undergrads, bootcamps, ML engineers entering the field
-
-**Volume II: Advanced ML Systems** (~1,100-1,150 pages)
- Distributed systems, production scale, responsibility
- Built on timeless principles
- Target: Graduate students, ML infrastructure engineers, senior practitioners
-
-**The One-Liner**:
-> "Volume I teaches you to build ML systems that work; Volume II teaches you to build ML systems that scale."
-
---
-
-## 📊 Current State (December 2024)
-
-### Existing Content
- **Total pages**: 2,172 pages across 22 chapters
- **Status**: Complete draft of comprehensive single-volume book
- **Quality**: Refined through extensive review feedback
-
-### Content Distribution
- **Chapters 1-14**: Form basis of Volume 1 (with surgery)
- **Chapters 15-21**: Move to Volume 2
- **New content needed**: 325-375 pages (8 new chapters for V2)
-
---
-
-## 🗓️ Six-Month Timeline
-
-### Phase 1: Chapter Surgery (Months 1-2)
-**Goal**: Extract distributed content from 7 chapters, create clean V1
-
-#### Month 1: Content Extraction
- **Week 1-2**: Extract distributed content from Chapters 6, 7, 8
-  - [ ] Chapter 6 (Data Engineering): Extract 40 pages of distributed content
-  - [ ] Chapter 7 (Frameworks): Extract 50 pages of distributed execution
-  - [ ] Chapter 8 (Training): Extract 60 pages of distributed training
-
- **Week 3-4**: Extract from Chapters 10, 11, 12, 13
-  - [ ] Chapter 10 (Optimizations): Extract 80 pages (NAS, AutoML)
-  - [ ] Chapter 11 (Hardware): Extract 50 pages (multi-chip)
-  - [ ] Chapter 12 (Benchmarking): Extract 40 pages (distributed benchmarking)
-  - [ ] Chapter 13 (MLOps): Extract 30 pages (production scale)
-
-#### Month 2: Bridging and Polish
- **Week 5-6**: Create V1 transitions
-  - [ ] Add "See Volume 2" callout boxes in V1
-  - [ ] Write brief distributed awareness sections
-  - [ ] Ensure V1 chapters remain coherent
-
- **Week 7-8**: Organize extracted content
-  - [ ] Create V2 chapter structure
-  - [ ] Place extracted content in appropriate V2 chapters
-  - [ ] Identify gaps in extracted content
-
-### Phase 2: New Content Development (Months 3-5)
-
-#### Month 3: Priority 1 Chapters (Essential Infrastructure)
- **Week 9-10**: Memory & Storage
-  - [ ] V2 Ch1: Memory Hierarchies for ML (45 pages)
-  - [ ] V2 Ch2: Storage Systems for ML (40 pages)
-
- **Week 11-12**: Communication & Distributed Training
-  - [ ] V2 Ch3: Communication & Collective Operations (45 pages)
-  - [ ] V2 Ch4: Distributed Training Systems (50 pages) - integrate extracted content
-
-#### Month 4: Priority 2 Chapters (Production Requirements)
- **Week 13-14**: Fault Tolerance & Inference
-  - [ ] V2 Ch5: Fault Tolerance & Resilience (40 pages)
-  - [ ] V2 Ch6: Inference at Scale (45 pages)
-
- **Week 15-16**: Integration
-  - [ ] Integrate all extracted content into new chapters
-  - [ ] Write chapter introductions and conclusions
-  - [ ] Create cross-references
-
-#### Month 5: Priority 3 Chapters (Specialized Topics)
- **Week 17-18**: Edge Systems
-  - [ ] V2 Ch8: Edge Intelligence Systems (50 pages)
-  - [ ] Integrate extracted edge content from Ch2
-
- **Week 19-20**: Final new chapters
-  - [ ] V2 Ch1: Bridge Chapter (30 pages) - From Single to Distributed
-  - [ ] Update existing V2 chapters (15-20) with introductions
-
-### Phase 3: Integration and Polish (Month 6)
-
-#### Month 6: Final Integration
- **Week 21-22**: Cross-References and Consistency
-  - [ ] Update all V1→V2 cross-references
-  - [ ] Update all V2→V1 prerequisite references
-  - [ ] Ensure consistent notation across volumes
-  - [ ] Verify all figure references work
-
- **Week 23**: Narrative Flow
-  - [ ] Review V1 narrative arc (Foundations → Building → Optimizing → Impact)
-  - [ ] Review V2 narrative arc (Scale → Production → Responsibility)
-  - [ ] Polish chapter transitions
-  - [ ] Write volume prefaces
-
- **Week 24**: Final Quality Checks
-  - [ ] Technical accuracy review
-  - [ ] Page count verification
-  - [ ] Exercise and quiz consistency
-  - [ ] Final copyedit pass
-  - [ ] Prepare camera-ready manuscripts
-
---
-
-## 📈 Progress Tracking
-
-### Volume 1 Progress (Target: 1,150-1,200 pages)
-
-| Chapter | Current | Target | Surgery Status | Notes |
-|---------|---------|--------|----------------|-------|
-| 1. Introduction | 90 | 60 | ⬜ Not Started | Compress history section |
-| 2. ML Systems | 70 | 70 | ⬜ Not Started | Extract hybrid architectures |
-| 3. DL Primer | 110 | 100 | ⬜ Not Started | No surgery (keep as-is) |
-| 4. DNN Architectures | 82 | 100 | ⬜ Not Started | No surgery (keep as-is) |
-| 5. Workflow | 51 | 40 | ⬜ Not Started | Minor compression |
-| 6. Data Engineering | 138 | 80 | ⬜ Not Started | Extract distributed storage |
-| 7. Frameworks | 121 | 100 | ⬜ Not Started | Extract distributed execution |
-| 8. Training | 157 | 100 | ⬜ Not Started | Extract distributed training |
-| 9. Efficient AI | 52 | 60 | ⬜ Not Started | No surgery (keep as-is) |
-| 10. Optimizations | 160 | 120 | ⬜ Not Started | Extract NAS, AutoML |
-| 11. Hardware | 181 | 90 | ⬜ Not Started | Extract multi-chip |
-| 12. Benchmarking | 124 | 80 | ⬜ Not Started | Extract distributed benchmarking |
-| 13. MLOps | 126 | 50 | ⬜ Not Started | Extract production scale |
-| 14. AI for Good | 84 | 50 | ⬜ Not Started | Minor compression |
-| **TOTAL** | **1,546** | **1,100** | **0%** | **V1 baseline complete** |
-
-Progress Key: ⬜ Not Started | 🟨 In Progress | ✅ Complete
-
-### Volume 2 Progress (Target: 1,100-1,150 pages)
-
-| Chapter | Source | Target | Status | Notes |
-|---------|--------|--------|--------|-------|
-| 1. Bridge: Single to Distributed | NEW | 30 | ⬜ Not Started | Write from scratch |
-| 2. Memory Hierarchies | NEW + Ch11 | 45 | ⬜ Not Started | New content + extracts |
-| 3. Storage Systems | NEW + Ch6 | 40 | ⬜ Not Started | New content + extracts |
-| 4. Communication & Collectives | NEW + Ch8 | 45 | ⬜ Not Started | New content + extracts |
-| 5. Distributed Training | Ch8 + Ch10 | 50 | ⬜ Not Started | Consolidate extracts |
-| 6. Fault Tolerance | NEW + Ch13 | 40 | ⬜ Not Started | New content + extracts |
-| 7. Inference at Scale | NEW + Ch2 + Ch13 | 45 | ⬜ Not Started | New content + extracts |
-| 8. Edge Intelligence | NEW + Ch2 | 50 | ⬜ Not Started | New content + extracts |
-| 9. On-Device Learning | Ch14 (existing) | 127 | ⬜ Not Started | Move from V1 |
-| 10. Privacy Systems | Ch15 (split) | 65 | ⬜ Not Started | Split Privacy/Security |
-| 11. Security Systems | Ch15 (split) | 68 | ⬜ Not Started | Split Privacy/Security |
-| 12. Robust AI | Ch16 (existing) | 137 | ⬜ Not Started | Move from V1 |
-| 13. Responsible AI | Ch17 (existing) | 135 | ⬜ Not Started | Move from V1 |
-| 14. Sustainable AI | Ch18 (existing) | 46 | ⬜ Not Started | Move from V1 |
-| 15. Frontiers & AGI | Ch19+20 (merge) | 78 | ⬜ Not Started | Merge two chapters |
-| **TOTAL** | **Mixed** | **1,001** | **0%** | **Need ~100 more pages** |
-
-### New Content Writing Progress (325-375 pages needed)
-
-| Chapter | Pages | Draft | Review | Final | Notes |
-|---------|-------|-------|--------|-------|-------|
-| Bridge Chapter | 30 | ⬜ | ⬜ | ⬜ | Priority 1 |
-| Memory Hierarchies | 45 | ⬜ | ⬜ | ⬜ | Priority 1 |
-| Storage Systems | 40 | ⬜ | ⬜ | ⬜ | Priority 1 |
-| Communication | 45 | ⬜ | ⬜ | ⬜ | Priority 2 |
-| Distributed Training | 50 | ⬜ | ⬜ | ⬜ | Priority 1 (+ extracts) |
-| Fault Tolerance | 40 | ⬜ | ⬜ | ⬜ | Priority 2 |
-| Inference at Scale | 45 | ⬜ | ⬜ | ⬜ | Priority 1 |
-| Edge Intelligence | 50 | ⬜ | ⬜ | ⬜ | Priority 3 |
-| **TOTAL NEW** | **345** | **0%** | **0%** | **0%** | 8 new chapters |
-
---
-
-## 📝 Weekly Progress Log
-
-### Week 1 (Dec 2-8, 2024)
- [ ] Created comprehensive surgical plan
- [ ] Created MIT Press proposal
- [ ] Created master roadmap
- [ ] **Next**: Begin Chapter 6 surgery
-
-### Week 2 (Dec 9-15, 2024)
- [ ] Progress notes...
-
-### Week 3 (Dec 16-22, 2024)
- [ ] Progress notes...
-
-*(Continue weekly log throughout 6-month period)*
-
---
-
-## 🎯 Key Milestones
-
- [ ] **End Month 1**: All distributed content extracted from V1 chapters
- [ ] **End Month 2**: V1 chapters coherent and complete
- [ ] **End Month 3**: Priority 1 new chapters drafted (160 pages)
- [ ] **End Month 4**: Priority 2 new chapters drafted (85 pages)
- [ ] **End Month 5**: All new content drafted (345 pages)
- [ ] **End Month 6**: Camera-ready manuscripts for both volumes
-
---
-
-## ⚠️ Risks and Mitigation
-
-### Risk 1: Page Count Imbalance
-**Risk**: V1 or V2 ends up significantly larger/smaller than target
-**Mitigation**:
- Monitor page counts weekly
- Adjust compression/expansion as needed
- Have flexibility targets (±100 pages)
-
-### Risk 2: Missing Dependencies
-**Risk**: V2 assumes V1 knowledge not actually covered
-**Mitigation**:
- Create prerequisite matrix
- Add recap sections to V2 chapters
- Review cross-references monthly
-
-### Risk 3: Timeline Slippage
-**Risk**: New chapter writing takes longer than estimated
-**Mitigation**:
- Prioritize essential chapters first
- Have backup plan to defer Priority 3 chapters
- Build 2-week buffer into timeline
-
-### Risk 4: Content Duplication
-**Risk**: Same concept explained in both volumes
-**Mitigation**:
- Clear "basic vs. advanced" delineation
- V2 references V1 explicitly
- Review for overlap in Month 6
-
---
-
-## 📚 Reference Materials
-
-### Pedagogical Framework
- **V1 Narrative**: Foundations → Building → Optimizing → Impact
- **V2 Narrative**: Scale → Production → Responsibility
- **Connection**: V1 ends with inspiration, V2 begins with bridge
-
-### Chapter Surgery Guidelines
- **Single-machine boundary**: Keep in V1
- **Distributed systems**: Move to V2
- **Production scale**: Move to V2
- **Advanced optimization**: Move to V2
-
-### Writing Standards
- Timeless principles over current tech
- Every chapter has: Purpose, Learning Outcomes, Summary
- Concrete examples throughout
- "Fallacies and Pitfalls" section
-
---
-
-## 🔧 Tools and Workflow
-
-### Version Control
- [ ] Create `volume-split` branch in Git
- [ ] Track all changes in branch
- [ ] Regular commits with clear messages
-
-### Organization
- `book/volume1/` - Volume 1 chapters
- `book/volume2/` - Volume 2 chapters
- `book/docs/` - All planning documents
- `book/extracted/` - Content extracted from V1
-
-### Quality Checks
- [ ] Weekly page count tracking
- [ ] Monthly cross-reference review
- [ ] Technical accuracy spot checks
- [ ] Pedagogical flow reviews
-
---
-
-## 📞 Stakeholder Communication
-
-### MIT Press Updates
- **Monthly**: Progress report with page counts
- **Major milestones**: Notify when phases complete
- **Issues**: Immediate communication of risks
-
-### Community/Reviewers
- **End Month 2**: Share V1 draft for review
- **End Month 4**: Share V2 draft chapters for review
- **End Month 5**: Full review cycle
-
---
-
-## ✅ Final Checklist (Month 6)
-
-### Volume 1 Completion
- [ ] All 14 chapters present and coherent
- [ ] Page count: 1,150-1,250 pages
- [ ] All cross-references to V2 marked clearly
- [ ] Exercises and quizzes updated
- [ ] Figures and tables numbered correctly
- [ ] Bibliography complete
- [ ] Index prepared
-
-### Volume 2 Completion
- [ ] All 15 chapters present and coherent
- [ ] Page count: 1,100-1,200 pages
- [ ] Bridge chapter effective
- [ ] New chapters integrate extracted content
- [ ] Exercises and quizzes complete
- [ ] Figures and tables numbered correctly
- [ ] Bibliography complete
- [ ] Index prepared
-
-### Both Volumes
- [ ] Consistent notation across volumes
- [ ] No content duplication
- [ ] Clear prerequisite chain
- [ ] Professional copyedit complete
- [ ] Ready for MIT Press submission
-
---
-
-## 📊 Success Metrics
-
-### Quantitative
- Volume 1: 1,150-1,250 pages ✓/✗
- Volume 2: 1,100-1,200 pages ✓/✗
- New content: 325-375 pages ✓/✗
- Timeline: 6 months ✓/✗
-
-### Qualitative
- Each volume independently valuable ✓/✗
- Clear pedagogical progression ✓/✗
- MIT Press approval ✓/✗
- Reviewer feedback positive ✓/✗
-
---
-
-## 🎓 Post-Completion
-
-### Publication Process
- [ ] Submit to MIT Press
- [ ] Incorporate editorial feedback
- [ ] Final production review
- [ ] Marketing materials
- [ ] Course adoption outreach
-
-### Maintenance
- [ ] Errata tracking system
- [ ] Annual review cycle
- [ ] Community feedback integration
- [ ] Future edition planning
-
---
-
-## 📝 Notes and Decisions
-
-### December 2024 - Project Launch
- Decision: Committed to 6-month timeline
- Decision: Will do full surgery, not quick split
- Decision: Flagship quality is priority over speed
- Next decision needed: [track decisions here]
-
---
-
-**Last Updated**: December 7, 2024
-**Status**: Planning Complete - Ready to Begin Execution
-**Next Action**: Begin Chapter 6 (Data Engineering) surgery - Week 1
-
---
-
-*This roadmap is the master coordination document for the two-volume split project. Update weekly with progress, decisions, and course corrections.*
--- a/book/docs/VOLUME_SPLIT_SURGICAL_PLAN.md
+++ b/book/docs/VOLUME_SPLIT_SURGICAL_PLAN.md
--- a/book/docs/VOLUME_STRUCTURE_PROPOSAL.md
+++ b/book/docs/VOLUME_STRUCTURE_PROPOSAL.md
@@ -1,216 +0,0 @@
-# Machine Learning Systems: Two-Volume Structure
-
-**Proposal for MIT Press**
-*Draft: December 2024*
-
---
-
-## Executive Summary
-
-The *Machine Learning Systems* textbook will be published as two complementary volumes of 14 chapters each:
-
-| Volume | Title | Focus | Chapters |
-|--------|-------|-------|----------|
-| **Volume 1** | Introduction to Machine Learning Systems | Complete ML lifecycle, single-system focus | 14 (all existing) |
-| **Volume 2** | Advanced Machine Learning Systems | Principles of scale, distribution, and production | 14 (6 existing, 8 new) |
-
-**Guiding Philosophy:**
- **Volume 1**: Everything you need to build ML systems on a single machine, ending on a positive note with societal impact
- **Volume 2**: Timeless principles for operating ML systems at scale, grounded in physics and mathematics rather than current technologies
-
---
-
-## Volume 1: Introduction to Machine Learning Systems
-
-*The complete ML lifecycle: understand it, build it, optimize it, deploy it, use it for good.*
-
-| Part | Chapter | Description |
-|------|---------|-------------|
-| **Part I: Systems Foundations** | | *What are ML systems?* |
-| | 1. Introduction | Motivation and scope |
-| | 2. ML Systems | System-level view of machine learning |
-| | 3. Deep Learning Primer | Neural network fundamentals |
-| | 4. DNN Architectures | Modern architecture patterns |
-| **Part II: Design Principles** | | *How do you build ML systems?* |
-| | 5. Workflow | End-to-end ML pipeline design |
-| | 6. Data Engineering | Data collection, processing, validation |
-| | 7. Frameworks | PyTorch, TensorFlow, JAX ecosystem |
-| | 8. Training | Training loops, hyperparameters, convergence |
-| **Part III: Performance Engineering** | | *How do you make ML systems fast?* |
-| | 9. Efficient AI | Efficiency principles and metrics |
-| | 10. Optimizations | Quantization, pruning, distillation |
-| | 11. Hardware Acceleration | GPUs, TPUs, custom accelerators |
-| | 12. Benchmarking | Measurement, MLPerf, evaluation methodology |
-| **Part IV: Practice & Impact** | | *How do you deploy and use ML systems responsibly?* |
-| | 13. ML Operations | Deployment, monitoring, CI/CD for ML |
-| | 14. AI for Good | Positive societal applications |
-
-**Total: 14 chapters across 4 parts (all existing content)**
-
-*Early awareness:* include a short Sustainable AI note in Benchmarking or ML Operations to flag energy and carbon impacts without adding another chapter.
-
-### Volume 1 Narrative Arc
-
-The book progresses from understanding → building → optimizing → deploying → impact:
-
-1. **Foundations** establish what ML systems are and why they matter
-2. **Design** teaches how to construct complete pipelines
-3. **Performance** shows how to make systems efficient
-4. **Practice & Impact** completes the lifecycle and ends on an inspirational note
-
-Ending on "AI for Good" leaves students with a positive vision of what they can build.
-
---
-
-## Volume 2: Advanced Machine Learning Systems
-
-*Timeless principles for building and operating ML systems at scale.*
-
-| Part | Chapter | Status | Description |
-|------|---------|--------|-------------|
-| **Part I: Data Movement & Memory** | | | *Moving data is the bottleneck* |
-| | 1. Memory Hierarchies for ML | 🆕 NEW | GPU memory, HBM, activation checkpointing |
-| | 2. Storage Systems for ML | 🆕 NEW | Distributed storage, checkpointing, feature stores |
-| | 3. Communication & Collective Operations | 🆕 NEW | AllReduce, gradient compression, network topology |
-| **Part II: Parallelism & Coordination** | | | *Decomposing computation across machines* |
-| | 4. Distributed Training | 🆕 NEW | Data/model/pipeline/tensor parallelism |
-| | 5. Fault Tolerance & Recovery | 🆕 NEW | Checkpointing, elastic training, failure handling |
-| | 6. Inference Systems | 🆕 NEW | Batching, serving architectures, autoscaling |
-| **Part III: Constrained Environments** | | | *Doing more with less* |
-| | 7. On-device Learning | Existing | Training and adaptation on edge devices |
-| | 8. Edge Deployment | 🆕 NEW | Compilation, runtime optimization, real-time |
-| **Part IV: Adversarial Environments** | | | *Systems under attack and uncertainty* |
-| | 9. Privacy in ML Systems | Existing | Differential privacy, federated learning, secure aggregation |
-| | 10. Security in ML Systems | 🆕 NEW | Supply chain, API security, multi-tenant isolation |
-| | 11. Robust AI | Existing | Adversarial robustness, distribution shift, monitoring |
-| **Part V: Stewardship** | | | *Building systems that serve humanity* |
-| | 12. Responsible AI | Existing | Fairness, accountability, transparency at scale |
-| | 13. Sustainable AI | Existing | Energy efficiency, carbon footprint, environmental impact |
-| | 14. Frontiers & Future Directions | Existing | Emerging paradigms, open problems, conclusion |
-
-**Total: 14 chapters across 5 parts (6 existing, 8 new)**
-
---
-
-## New Content for Volume 2
-
-### Part I: Data Movement & Memory
-*The physics of data movement is the fundamental constraint in modern ML.*
-
-| Chapter | Key Topics | Timeless Principle |
-|---------|------------|-------------------|
-| **Memory Hierarchies for ML** | GPU memory management, HBM architecture, caching strategies, activation checkpointing, memory-efficient attention | Memory bandwidth limits compute utilization |
-| **Storage Systems for ML** | Distributed file systems, checkpoint I/O, feature stores, data lakes, prefetching, I/O scheduling | Storage throughput gates training speed |
-| **Communication & Collective Operations** | AllReduce algorithms, ring/tree topologies, gradient compression, RDMA fundamentals, network topology design | Communication overhead limits scaling |
-
-### Part II: Parallelism & Coordination
-*The mathematics of decomposing work across machines.*
-
-| Chapter | Key Topics | Timeless Principle |
-|---------|------------|-------------------|
-| **Distributed Training** | Data parallelism, model parallelism (tensor, pipeline, expert), hybrid strategies, synchronization, load balancing | Parallelism has fundamental trade-offs |
-| **Fault Tolerance & Recovery** | Checkpoint strategies, async checkpointing, elastic training, failure detection, graceful degradation | Large systems fail; recovery must be designed in |
-| **Inference Systems** | Batching strategies, continuous batching, KV cache management, model serving patterns, autoscaling, SLO management | Serving has different constraints than training |
-
-### Part III: Constrained Environments
-*Operating under resource limitations.*
-
-| Chapter | Key Topics | Timeless Principle |
-|---------|------------|-------------------|
-| **Edge Deployment** | Model compilation, runtime optimization, heterogeneous hardware, real-time constraints, power management | Constraints force creativity |
-
-### Part IV: Adversarial Environments
-*Systems facing attacks, privacy requirements, and uncertainty.*
-
-| Chapter | Key Topics | Timeless Principle |
-|---------|------------|-------------------|
-| **Security in ML Systems** | Model provenance, supply chain security, API protection, multi-tenant isolation, access control | Production systems face adversaries |
-
---
-
-## Design Principles
-
-### Why This Structure Works
-
-**Volume 1 (Single System)**
- Teaches the complete lifecycle
- Everything can be learned and practiced on one machine
- Ends positively with societal impact
-
-**Volume 2 (Distributed Systems)**
- Builds on Volume 1 foundations
- Addresses what changes at scale
- Organized around timeless constraints, not current technologies
-
-### What Makes Volume 2 Timeless
-
-Each part addresses constraints rooted in physics, mathematics, or human nature:
-
-| Part | Eternal Constraint | Foundation |
-|------|-------------------|------------|
-| Data Movement & Memory | Moving data costs more than compute | Physics: speed of light, memory bandwidth |
-| Parallelism & Coordination | Work must be decomposed and synchronized | Mathematics of parallel computation |
-| Constrained Environments | Resources are always finite | Economics and physics |
-| Adversarial Environments | Attackers and uncertainty exist | Human nature, statistics |
-| Stewardship | Technology must serve humanity | Ethics, sustainability |
-
-Chapters use current examples (LLMs, transformers, specific hardware) but frame them as instances of these enduring principles.
-
---
-
-## Content Migration Summary
-
-| Chapter | Volume 1 | Volume 2 | Rationale |
-|---------|----------|----------|-----------|
-| Introduction through Benchmarking | ✓ | | Core technical content |
-| ML Operations | ✓ | | Completes the lifecycle |
-| AI for Good | ✓ | | Positive conclusion |
-| On-device Learning | | ✓ | Edge/constrained is advanced |
-| Privacy & Security | | ✓ | Production security is advanced |
-| Robust AI | | ✓ | Production robustness is advanced |
-| Responsible AI | | ✓ | Scale changes the challenges |
-| Sustainable AI | | ✓ | Datacenter scale is advanced |
-| Frontiers | | ✓ | Conclusion for advanced volume |
-
---
-
-## Audience
-
-| Volume | Primary Audience | Use Cases |
-|--------|-----------------|-----------|
-| Volume 1 | All ML practitioners, undergraduates, bootcamp students | First course in ML systems, self-study |
-| Volume 2 | Infrastructure engineers, graduate students, researchers | Advanced course, reference for practitioners at scale |
-
---
-
-## Collaboration Model
-
-Volume 2's new chapters are candidates for collaborative authorship:
-
-| Topic Area | Ideal Collaborator Profile |
-|------------|---------------------------|
-| Memory & Storage | Datacenter architects, MLPerf Storage contributors |
-| Networking & Communication | Distributed systems researchers, framework developers |
-| Distributed Training | PyTorch/JAX distributed teams, hyperscaler engineers |
-| Fault Tolerance | Site reliability engineers, systems researchers |
-| Inference Systems | ML serving infrastructure engineers |
-| Edge Deployment | Embedded ML practitioners, compiler engineers |
-| Security | ML security researchers, production security engineers |
-
---
-
-## Summary Statistics
-
-| Metric | Volume 1 | Volume 2 |
-|--------|----------|----------|
-| Chapters | 14 | 14 |
-| Parts | 4 | 5 |
-| Existing content | 14 | 6 |
-| New content | 0 | 8 |
-| Focus | Single system | Distributed systems |
-| Prerequisite | None | Volume 1 |
-
---
-
-*Document Version: December 2024*
-*For discussion with MIT Press and potential collaborators*
--- a/book/docs/reviewer-feedback-synthesis-r1.md
+++ b/book/docs/reviewer-feedback-synthesis-r1.md
@@ -1,175 +0,0 @@
-# Round 1 Reviewer Feedback Synthesis
-
-## Reviewers
- **David Patterson** - Computer architecture, textbook author
- **Ion Stoica** - Distributed systems (Ray, Spark), Berkeley
- **Vijay Reddi** - TinyML, MLPerf, Harvard
- **Jeff Dean** - Google Senior Fellow, large-scale systems
-
---
-
-## Consensus Issues (All/Most Reviewers Agree)
-
-### 1. Volume I is Incomplete Without Some Production/Scale Awareness
-
-**Patterson**: "Cannot teach deployment without responsibility integration"
-**Stoica**: "Data parallelism is now foundational, not advanced"
-**Reddi**: "Edge deployment is fundamental, not advanced"
-**Dean**: "Scale thinking should be woven throughout Vol I, not deferred"
-
-**Consensus**: Vol I currently produces graduates who lack awareness of:
- Distributed training basics (Stoica, Dean)
- Resource constraints/edge deployment (Reddi)
- Responsible practices (Patterson)
- Cost and production realities (Dean)
-
-### 2. Chapter 14 "Preview" Approach is Problematic
-
-**Patterson**: "Pedagogically misguided - responsibility should be integrated throughout"
-**All others**: Generally agree preview is insufficient
-
-**Consensus**: The "preview" approach treats important topics as afterthoughts.
-
-### 3. Volume II Part I/II Ordering Needs Work
-
-**Stoica**: "Teaching infrastructure before algorithms is pedagogically backwards"
-**Dean**: Agrees infra should come with context
-
-**Consensus**: Teach distributed algorithms first, then infrastructure that supports them.
-
-### 4. Missing Hands-On/Practical Content
-
-**Patterson**: "No mention of labs or programming assignments"
-**Reddi**: "Students cannot deploy to microcontroller after Vol I"
-**Dean**: "Missing debugging and profiling skills"
-
-**Consensus**: Both volumes need explicit practical components.
-
---
-
-## Key Disagreements/Different Emphases
-
-### What Should Move to Volume I?
-
-| Topic | Patterson | Stoica | Reddi | Dean |
-|-------|-----------|--------|-------|------|
-| Edge/TinyML deployment | Maybe | - | **CRITICAL** | - |
-| Data parallelism basics | - | **CRITICAL** | - | Important |
-| Checkpointing | - | **CRITICAL** | - | Important |
-| Responsible AI integration | **CRITICAL** | - | - | - |
-| Cost awareness | - | - | - | **CRITICAL** |
-
-**Tension**: Each reviewer wants their specialty area elevated in Vol I. Cannot add everything without making Vol I too large.
-
-### Chapter Count
-
- **Patterson**: 14 chapters OK if balanced, concerned about page counts
- **Reddi**: Suggests 15 chapters (add edge deployment)
- **Stoica/Dean**: Focus less on count, more on content depth
-
---
-
-## Specific Recommendations by Volume
-
-### Volume I Additions (Ranked by Consensus)
-
-| Priority | Addition | Supporters |
-|----------|----------|------------|
-| HIGH | Data parallelism basics in Ch 8 (Training) | Stoica, Dean |
-| HIGH | Checkpointing basics in Ch 8 | Stoica, Dean |
-| HIGH | Resource-constrained deployment chapter | Reddi (strong), Patterson (partial) |
-| HIGH | Cost/efficiency awareness throughout | Dean |
-| MEDIUM | Integrate responsibility throughout (not just Ch 14) | Patterson |
-| MEDIUM | Expand quantization/pruning depth (Ch 10) | Reddi |
-| MEDIUM | Strengthen benchmarking rigor (Ch 12) | Reddi |
-
-### Volume II Restructuring
-
-| Priority | Change | Supporters |
-|----------|--------|------------|
-| HIGH | Reorder Parts I/II (algorithms before infrastructure) | Stoica, Dean |
-| HIGH | Add distributed systems theory basics | Stoica |
-| MEDIUM | Add production debugging chapter | Dean, Stoica |
-| MEDIUM | Expand MLOps chapter significantly | Dean |
-| MEDIUM | Add cost/resource management | Dean |
-
---
-
-## The Central Dilemma
-
-**Cannot add everything without making volumes too large.**
-
-Options:
-
-### Option A: Minimal Vol I Changes (Original + Small Additions)
- Keep current 14-chapter structure
- Add data parallelism section to Ch 8
- Add checkpointing section to Ch 8
- Strengthen Ch 14 preview (but keep as preview)
- Vol II restructures Part I/II ordering
-
-**Pro**: Minimal disruption, faster to implement
-**Con**: Patterson and Reddi concerns not fully addressed
-
-### Option B: Add Edge Deployment to Vol I (Reddi's Recommendation)
- Add new Chapter 12: "Resource-Constrained Deployment"
- Renumber remaining chapters (15 total)
- Expand quantization/pruning depth
- Vol II restructures Part I/II
-
-**Pro**: Addresses critical industry need (mobile/embedded)
-**Con**: Makes Vol I larger, may be too ambitious
-
-### Option C: Integrate Responsibility Throughout Vol I (Patterson's Recommendation)
- Distribute responsible systems content across chapters
- Remove standalone Ch 14 preview
- Add fairness to Ch 6 (Data), security to Ch 10, sustainability to Ch 9
- Keep 14 chapters but redistribute content
-
-**Pro**: Pedagogically sounder integration
-**Con**: Significant rewrite of multiple chapters
-
-### Option D: Hybrid - Core Additions Only
- Add data parallelism + checkpointing to Ch 8 (Stoica/Dean consensus)
- Add brief edge deployment section to Ch 13 (MLOps) not new chapter
- Keep Ch 14 but strengthen it with integrated callouts in earlier chapters
- Vol II restructures Part I/II
-
-**Pro**: Addresses highest-consensus items without major restructure
-**Con**: Doesn't fully satisfy any single reviewer
-
---
-
-## Recommendation for User Decision
-
-**Suggested path forward**: Option D (Hybrid) for Vol I structure, with Vol II restructuring.
-
-**Rationale**:
-1. Stoica and Dean (industry leaders in distributed systems) agree on data parallelism/checkpointing - this is the highest consensus item
-2. Full edge deployment chapter (Reddi) is valuable but may be too ambitious for immediate restructure
-3. Full responsibility integration (Patterson) is pedagogically ideal but requires significant rewrite
-4. Vol II restructuring (algorithms before infrastructure) has clear consensus
-
-**What this means for your current draft**:
-
-**Volume I Changes**:
- Ch 8 (Training): Add "Distributed Training Fundamentals" section
- Ch 8 (Training): Add "Checkpointing" section
- Ch 9 (Efficiency): Add brief energy/sustainability measurement
- Ch 14 (Preview): Strengthen, add forward references throughout earlier chapters
-
-**Volume II Changes**:
- Reorder: Distributed Training → Communication → Fault Tolerance → THEN Infrastructure
- Add brief theory section to Ch 1
-
---
-
-## Questions for User
-
-1. **Edge deployment priority**: Is adding a full edge deployment chapter to Vol I worth the extra scope? (Reddi makes a strong case for industry relevance)
-
-2. **Responsibility integration**: Should we integrate responsible AI throughout Vol I chapters (Patterson's strong recommendation), or keep the preview approach?
-
-3. **Page count targets**: Do you have MIT Press guidance on target page counts? This affects how much we can add.
-
-4. **Volume II priority**: Is restructuring Vol II Part I/II ordering acceptable, or is that structure already locked?
--- a/book/docs/volume-outline-draft.md
+++ b/book/docs/volume-outline-draft.md
@@ -1,207 +0,0 @@
-# Machine Learning Systems: Two-Volume Structure
-
-**Status**: Approved (Round 2 Review Complete)
-**Target Publisher**: MIT Press
-**Audience**: Undergraduate and graduate CS/ECE students, academic courses
-
---
-
-## Overview
-
-This textbook is being split into two volumes to serve different learning objectives:
-
- **Volume I: Introduction to ML Systems** - Foundational knowledge for building, optimizing, and deploying ML systems
- **Volume II: Advanced ML Systems** - Scale, distributed systems, production hardening, and responsible deployment
-
-Each volume should stand alone as a complete learning experience while together forming a comprehensive treatment of the field.
-
---
-
-## Volume I: Introduction to ML Systems
-
-### Goal
-A reader completes Volume I and can competently build, optimize, and deploy ML systems with awareness of responsible practices.
-
-### Target Audience
- Upper-level undergraduates
- Early graduate students
- Practitioners transitioning into ML systems
-
-### Course Mapping
- Single semester "Introduction to ML Systems" course
- Foundation for more advanced distributed systems or MLOps courses
-
-### Structure (14 chapters)
-
-#### Part I: Foundations
-Establish the conceptual framework for understanding ML as a systems discipline.
-
-| Ch | Title | Purpose |
-|----|-------|---------|
-| 1 | Introduction to Machine Learning Systems | Why ML systems thinking matters |
-| 2 | The ML Systems Landscape | Survey of the field, key components |
-| 3 | Deep Learning Foundations | Mathematical and conceptual foundations |
-| 4 | Modern Neural Architectures | CNNs, RNNs, Transformers, architectural choices |
-
-#### Part II: Development
-Practical skills for constructing ML systems from data to trained model.
-
-| Ch | Title | Purpose |
-|----|-------|---------|
-| 5 | ML Development Workflow | End-to-end process, experimentation |
-| 6 | Data Engineering for ML | Pipelines, preprocessing, data quality |
-| 7 | ML Frameworks and Tools | PyTorch, TensorFlow, ecosystem |
-| 8 | Training Systems | Training loops, distributed basics, debugging |
-
-#### Part III: Optimization
-Techniques for making ML systems efficient and fast.
-
-| Ch | Title | Purpose |
-|----|-------|---------|
-| 9 | Efficiency in AI Systems | Why efficiency matters, metrics |
-| 10 | Model Optimization Techniques | Quantization, pruning, distillation |
-| 11 | Hardware Acceleration | GPUs, TPUs, custom accelerators |
-| 12 | Benchmarking and Evaluation | Measuring performance, MLPerf |
-
-#### Part IV: Operations
-Getting models into production responsibly.
-
-| Ch | Title | Purpose |
-|----|-------|---------|
-| 13 | ML Operations Fundamentals | Deployment, monitoring, CI/CD for ML |
-| 14 | Responsible Systems Preview | Brief intro to robustness, security, fairness, sustainability (preview of Vol II topics) |
-
-#### Volume I Conclusion
-Synthesis, what was learned, bridge to Volume II.
-
---
-
-## Volume II: Advanced ML Systems
-
-### Goal
-A reader completes Volume II understanding how to build and operate ML systems at scale, with production resilience and responsible practices.
-
-### Target Audience
- Graduate students
- Industry practitioners
- Researchers building large-scale systems
-
-### Prerequisites
- Volume I or equivalent knowledge
- Basic distributed systems concepts helpful
-
-### Course Mapping
- Graduate seminar on large-scale ML systems
- Advanced MLOps course
- Research group reading material
-
-### Structure (16 chapters)
-
-#### Part I: Foundations of Scale
-Infrastructure and concepts for scaling beyond single machines.
-
-| Ch | Title | Purpose |
-|----|-------|---------|
-| 1 | From Single Systems to Planetary Scale | Motivation, challenges of scale |
-| 2 | Infrastructure for Large-Scale ML | Clusters, cloud, resource management |
-| 3 | Storage Systems for ML | Data lakes, distributed storage, checkpointing |
-| 4 | Communication and Collective Operations | AllReduce, parameter servers, network topology |
-
-#### Part II: Distributed Systems
-Training and inference across multiple machines.
-
-| Ch | Title | Purpose |
-|----|-------|---------|
-| 5 | Distributed Training Systems | Data parallel, model parallel, pipeline parallel |
-| 6 | Fault Tolerance and Resilience | Checkpointing, recovery, handling failures |
-| 7 | Inference at Scale | Serving systems, batching, latency optimization |
-| 8 | Edge Intelligence Systems | Deploying ML at the edge, constraints |
-
-#### Part III: Production Challenges
-Real-world complexities of operating ML systems.
-
-| Ch | Title | Purpose |
-|----|-------|---------|
-| 9 | On-Device Learning | Training on edge devices, federated learning |
-| 10 | Privacy-Preserving ML Systems | Differential privacy, secure computation |
-| 11 | Robust and Reliable AI | Adversarial robustness, distribution shift |
-| 12 | ML Operations at Scale | Advanced MLOps, platform engineering |
-
-#### Part IV: Responsible Deployment
-Building ML systems that benefit society.
-
-| Ch | Title | Purpose |
-|----|-------|---------|
-| 13 | Responsible AI Systems | Fairness, accountability, transparency |
-| 14 | Sustainable AI | Environmental impact, efficient computing |
-| 15 | AI for Good | Applications for societal benefit |
-| 16 | Frontiers and Future Directions | Emerging trends, open problems |
-
-#### Volume II Conclusion
-Synthesis, future of the field, call to action.
-
---
-
-## Key Design Decisions
-
-### Why This Split?
-
-1. **Pedagogical Progression**: Vol I covers what every ML practitioner needs. Vol II covers what scale/production engineers need.
-
-2. **Course Adoptability**: Vol I maps to a single semester intro course. Vol II maps to an advanced graduate seminar.
-
-3. **Standalone Completeness**: A reader of only Vol I still gets responsible systems awareness through Chapter 14.
-
-4. **Industry Alignment**: Vol I produces capable junior engineers. Vol II produces senior/staff-level systems thinkers.
-
-### Chapter 14 in Volume I: Responsible Systems Preview
-
-This chapter is intentionally brief (preview, not deep dive) covering:
- Robustness basics (models can fail)
- Security basics (models can be attacked)
- Fairness basics (models can discriminate)
- Sustainability basics (training has environmental cost)
-
-Each topic points to the relevant Volume II chapter for deep treatment. This ensures Vol I readers are aware of these concerns without duplicating Vol II content.
-
-### What Moves Between Volumes?
-
-From original single-book structure:
- On-Device Learning → Vol II (requires scale context)
- Privacy/Security → Vol II (production concern)
- Robust AI → Vol II (advanced topic)
- Responsible AI → Vol II (deep treatment)
- Sustainable AI → Vol II (deep treatment)
- AI for Good → Vol II (capstone application)
- Frontiers → Vol II (forward-looking capstone)
-
---
-
-## Questions for Reviewers
-
-1. Does Volume I stand alone as a complete, responsible introduction to ML systems?
-
-2. Is the progression within each volume logical for students?
-
-3. Would you adopt Volume I for an introductory ML systems course?
-
-4. Is Chapter 14 (Responsible Systems Preview) sufficient, or should Vol I include more depth on any topic?
-
-5. Are any chapters misplaced between volumes?
-
-6. Is 14 chapters (Vol I) and 16 chapters (Vol II) appropriate sizing?
-
-7. What's missing from either volume?
-
---
-
-## Revision History
-
- **v0.1** (2024-12-31): Initial draft for review
- **v0.2** (2024-12-31): Updated Part names based on reviewer feedback
-  - Volume I: Single-word Part names (Foundations, Development, Optimization, Operations)
-  - Volume II: Two-word Part names (unchanged, already clear)
- **v1.0** (2024-12-31): Structure approved after Round 2 review
-  - All reviewers (Patterson, Stoica, Reddi, Dean) approve chapter structure
-  - Part naming convention approved
-  - Ready for website implementation