chore: remove volume-splitting strategy docs from public repo

These files contain internal planning documents that should not be public: - VOLUME_SPLIT_ROADMAP.md (project timeline and MIT Press plans) - VOLUME_SPLIT_SURGICAL_PLAN.md (section-by-section surgery) - VOLUME_STRUCTURE_PROPOSAL.md (publisher proposal draft)
2026-03-11 17:49:25 -05:00 · 2026-01-04 18:07:27 -05:00
parent 0f1b9d1c77
commit cdcb4e9529
3 changed files with 0 additions and 2043 deletions
--- a/book/docs/VOLUME_SPLIT_ROADMAP.md
+++ b/book/docs/VOLUME_SPLIT_ROADMAP.md
@@ -1,394 +0,0 @@
-# Machine Learning Systems: Two-Volume Split - Master Roadmap
-
-**Project Start Date**: December 2024
-**Target Completion**: June 2025 (6 months)
-**Goal**: Create flagship two-volume ML Systems textbook series for MIT Press
-
---
-
-## 📋 Project Documents
-
-### Core Planning Documents
-1. **`MIT_PRESS_PROPOSAL.md`** - Official proposal for MIT Press (ready to submit)
-2. **`VOLUME_SPLIT_SURGICAL_PLAN.md`** - Section-by-section surgery instructions (50+ pages)
-3. **`VOLUME_SPLIT_ROADMAP.md`** - This document: master project plan and progress tracker
-4. **`VOLUME_STRUCTURE_PROPOSAL.md`** - Original analysis and proposal
-
-### Supporting Documents
- `VOLUME_SPLIT_ANALYSIS.md` - Deep analysis of pedagogical issues
- `VOLUME_SPLIT_EXECUTIVE_SUMMARY.md` - Quick reference summary
- `DISTRIBUTED_CONTENT_ADDITIONS.md` - Distributed awareness additions for V1
-
---
-
-## 🎯 Project Vision
-
-**Volume I: Introduction to ML Systems** (~1,150-1,200 pages)
- Complete single-system ML engineering
- Includes distributed awareness (not implementation)
- Target: Undergrads, bootcamps, ML engineers entering the field
-
-**Volume II: Advanced ML Systems** (~1,100-1,150 pages)
- Distributed systems, production scale, responsibility
- Built on timeless principles
- Target: Graduate students, ML infrastructure engineers, senior practitioners
-
-**The One-Liner**:
-> "Volume I teaches you to build ML systems that work; Volume II teaches you to build ML systems that scale."
-
---
-
-## 📊 Current State (December 2024)
-
-### Existing Content
- **Total pages**: 2,172 pages across 22 chapters
- **Status**: Complete draft of comprehensive single-volume book
- **Quality**: Refined through extensive review feedback
-
-### Content Distribution
- **Chapters 1-14**: Form basis of Volume 1 (with surgery)
- **Chapters 15-21**: Move to Volume 2
- **New content needed**: 325-375 pages (8 new chapters for V2)
-
---
-
-## 🗓️ Six-Month Timeline
-
-### Phase 1: Chapter Surgery (Months 1-2)
-**Goal**: Extract distributed content from 7 chapters, create clean V1
-
-#### Month 1: Content Extraction
- **Week 1-2**: Extract distributed content from Chapters 6, 7, 8
-  - [ ] Chapter 6 (Data Engineering): Extract 40 pages of distributed content
-  - [ ] Chapter 7 (Frameworks): Extract 50 pages of distributed execution
-  - [ ] Chapter 8 (Training): Extract 60 pages of distributed training
-
- **Week 3-4**: Extract from Chapters 10, 11, 12, 13
-  - [ ] Chapter 10 (Optimizations): Extract 80 pages (NAS, AutoML)
-  - [ ] Chapter 11 (Hardware): Extract 50 pages (multi-chip)
-  - [ ] Chapter 12 (Benchmarking): Extract 40 pages (distributed benchmarking)
-  - [ ] Chapter 13 (MLOps): Extract 30 pages (production scale)
-
-#### Month 2: Bridging and Polish
- **Week 5-6**: Create V1 transitions
-  - [ ] Add "See Volume 2" callout boxes in V1
-  - [ ] Write brief distributed awareness sections
-  - [ ] Ensure V1 chapters remain coherent
-
- **Week 7-8**: Organize extracted content
-  - [ ] Create V2 chapter structure
-  - [ ] Place extracted content in appropriate V2 chapters
-  - [ ] Identify gaps in extracted content
-
-### Phase 2: New Content Development (Months 3-5)
-
-#### Month 3: Priority 1 Chapters (Essential Infrastructure)
- **Week 9-10**: Memory & Storage
-  - [ ] V2 Ch1: Memory Hierarchies for ML (45 pages)
-  - [ ] V2 Ch2: Storage Systems for ML (40 pages)
-
- **Week 11-12**: Communication & Distributed Training
-  - [ ] V2 Ch3: Communication & Collective Operations (45 pages)
-  - [ ] V2 Ch4: Distributed Training Systems (50 pages) - integrate extracted content
-
-#### Month 4: Priority 2 Chapters (Production Requirements)
- **Week 13-14**: Fault Tolerance & Inference
-  - [ ] V2 Ch5: Fault Tolerance & Resilience (40 pages)
-  - [ ] V2 Ch6: Inference at Scale (45 pages)
-
- **Week 15-16**: Integration
-  - [ ] Integrate all extracted content into new chapters
-  - [ ] Write chapter introductions and conclusions
-  - [ ] Create cross-references
-
-#### Month 5: Priority 3 Chapters (Specialized Topics)
- **Week 17-18**: Edge Systems
-  - [ ] V2 Ch8: Edge Intelligence Systems (50 pages)
-  - [ ] Integrate extracted edge content from Ch2
-
- **Week 19-20**: Final new chapters
-  - [ ] V2 Ch1: Bridge Chapter (30 pages) - From Single to Distributed
-  - [ ] Update existing V2 chapters (15-20) with introductions
-
-### Phase 3: Integration and Polish (Month 6)
-
-#### Month 6: Final Integration
- **Week 21-22**: Cross-References and Consistency
-  - [ ] Update all V1→V2 cross-references
-  - [ ] Update all V2→V1 prerequisite references
-  - [ ] Ensure consistent notation across volumes
-  - [ ] Verify all figure references work
-
- **Week 23**: Narrative Flow
-  - [ ] Review V1 narrative arc (Foundations → Building → Optimizing → Impact)
-  - [ ] Review V2 narrative arc (Scale → Production → Responsibility)
-  - [ ] Polish chapter transitions
-  - [ ] Write volume prefaces
-
- **Week 24**: Final Quality Checks
-  - [ ] Technical accuracy review
-  - [ ] Page count verification
-  - [ ] Exercise and quiz consistency
-  - [ ] Final copyedit pass
-  - [ ] Prepare camera-ready manuscripts
-
---
-
-## 📈 Progress Tracking
-
-### Volume 1 Progress (Target: 1,150-1,200 pages)
-
-| Chapter | Current | Target | Surgery Status | Notes |
-|---------|---------|--------|----------------|-------|
-| 1. Introduction | 90 | 60 | ⬜ Not Started | Compress history section |
-| 2. ML Systems | 70 | 70 | ⬜ Not Started | Extract hybrid architectures |
-| 3. DL Primer | 110 | 100 | ⬜ Not Started | No surgery (keep as-is) |
-| 4. DNN Architectures | 82 | 100 | ⬜ Not Started | No surgery (keep as-is) |
-| 5. Workflow | 51 | 40 | ⬜ Not Started | Minor compression |
-| 6. Data Engineering | 138 | 80 | ⬜ Not Started | Extract distributed storage |
-| 7. Frameworks | 121 | 100 | ⬜ Not Started | Extract distributed execution |
-| 8. Training | 157 | 100 | ⬜ Not Started | Extract distributed training |
-| 9. Efficient AI | 52 | 60 | ⬜ Not Started | No surgery (keep as-is) |
-| 10. Optimizations | 160 | 120 | ⬜ Not Started | Extract NAS, AutoML |
-| 11. Hardware | 181 | 90 | ⬜ Not Started | Extract multi-chip |
-| 12. Benchmarking | 124 | 80 | ⬜ Not Started | Extract distributed benchmarking |
-| 13. MLOps | 126 | 50 | ⬜ Not Started | Extract production scale |
-| 14. AI for Good | 84 | 50 | ⬜ Not Started | Minor compression |
-| **TOTAL** | **1,546** | **1,100** | **0%** | **V1 baseline complete** |
-
-Progress Key: ⬜ Not Started | 🟨 In Progress | ✅ Complete
-
-### Volume 2 Progress (Target: 1,100-1,150 pages)
-
-| Chapter | Source | Target | Status | Notes |
-|---------|--------|--------|--------|-------|
-| 1. Bridge: Single to Distributed | NEW | 30 | ⬜ Not Started | Write from scratch |
-| 2. Memory Hierarchies | NEW + Ch11 | 45 | ⬜ Not Started | New content + extracts |
-| 3. Storage Systems | NEW + Ch6 | 40 | ⬜ Not Started | New content + extracts |
-| 4. Communication & Collectives | NEW + Ch8 | 45 | ⬜ Not Started | New content + extracts |
-| 5. Distributed Training | Ch8 + Ch10 | 50 | ⬜ Not Started | Consolidate extracts |
-| 6. Fault Tolerance | NEW + Ch13 | 40 | ⬜ Not Started | New content + extracts |
-| 7. Inference at Scale | NEW + Ch2 + Ch13 | 45 | ⬜ Not Started | New content + extracts |
-| 8. Edge Intelligence | NEW + Ch2 | 50 | ⬜ Not Started | New content + extracts |
-| 9. On-Device Learning | Ch14 (existing) | 127 | ⬜ Not Started | Move from V1 |
-| 10. Privacy Systems | Ch15 (split) | 65 | ⬜ Not Started | Split Privacy/Security |
-| 11. Security Systems | Ch15 (split) | 68 | ⬜ Not Started | Split Privacy/Security |
-| 12. Robust AI | Ch16 (existing) | 137 | ⬜ Not Started | Move from V1 |
-| 13. Responsible AI | Ch17 (existing) | 135 | ⬜ Not Started | Move from V1 |
-| 14. Sustainable AI | Ch18 (existing) | 46 | ⬜ Not Started | Move from V1 |
-| 15. Frontiers & AGI | Ch19+20 (merge) | 78 | ⬜ Not Started | Merge two chapters |
-| **TOTAL** | **Mixed** | **1,001** | **0%** | **Need ~100 more pages** |
-
-### New Content Writing Progress (325-375 pages needed)
-
-| Chapter | Pages | Draft | Review | Final | Notes |
-|---------|-------|-------|--------|-------|-------|
-| Bridge Chapter | 30 | ⬜ | ⬜ | ⬜ | Priority 1 |
-| Memory Hierarchies | 45 | ⬜ | ⬜ | ⬜ | Priority 1 |
-| Storage Systems | 40 | ⬜ | ⬜ | ⬜ | Priority 1 |
-| Communication | 45 | ⬜ | ⬜ | ⬜ | Priority 2 |
-| Distributed Training | 50 | ⬜ | ⬜ | ⬜ | Priority 1 (+ extracts) |
-| Fault Tolerance | 40 | ⬜ | ⬜ | ⬜ | Priority 2 |
-| Inference at Scale | 45 | ⬜ | ⬜ | ⬜ | Priority 1 |
-| Edge Intelligence | 50 | ⬜ | ⬜ | ⬜ | Priority 3 |
-| **TOTAL NEW** | **345** | **0%** | **0%** | **0%** | 8 new chapters |
-
---
-
-## 📝 Weekly Progress Log
-
-### Week 1 (Dec 2-8, 2024)
- [ ] Created comprehensive surgical plan
- [ ] Created MIT Press proposal
- [ ] Created master roadmap
- [ ] **Next**: Begin Chapter 6 surgery
-
-### Week 2 (Dec 9-15, 2024)
- [ ] Progress notes...
-
-### Week 3 (Dec 16-22, 2024)
- [ ] Progress notes...
-
-*(Continue weekly log throughout 6-month period)*
-
---
-
-## 🎯 Key Milestones
-
- [ ] **End Month 1**: All distributed content extracted from V1 chapters
- [ ] **End Month 2**: V1 chapters coherent and complete
- [ ] **End Month 3**: Priority 1 new chapters drafted (160 pages)
- [ ] **End Month 4**: Priority 2 new chapters drafted (85 pages)
- [ ] **End Month 5**: All new content drafted (345 pages)
- [ ] **End Month 6**: Camera-ready manuscripts for both volumes
-
---
-
-## ⚠️ Risks and Mitigation
-
-### Risk 1: Page Count Imbalance
-**Risk**: V1 or V2 ends up significantly larger/smaller than target
-**Mitigation**:
- Monitor page counts weekly
- Adjust compression/expansion as needed
- Have flexibility targets (±100 pages)
-
-### Risk 2: Missing Dependencies
-**Risk**: V2 assumes V1 knowledge not actually covered
-**Mitigation**:
- Create prerequisite matrix
- Add recap sections to V2 chapters
- Review cross-references monthly
-
-### Risk 3: Timeline Slippage
-**Risk**: New chapter writing takes longer than estimated
-**Mitigation**:
- Prioritize essential chapters first
- Have backup plan to defer Priority 3 chapters
- Build 2-week buffer into timeline
-
-### Risk 4: Content Duplication
-**Risk**: Same concept explained in both volumes
-**Mitigation**:
- Clear "basic vs. advanced" delineation
- V2 references V1 explicitly
- Review for overlap in Month 6
-
---
-
-## 📚 Reference Materials
-
-### Pedagogical Framework
- **V1 Narrative**: Foundations → Building → Optimizing → Impact
- **V2 Narrative**: Scale → Production → Responsibility
- **Connection**: V1 ends with inspiration, V2 begins with bridge
-
-### Chapter Surgery Guidelines
- **Single-machine boundary**: Keep in V1
- **Distributed systems**: Move to V2
- **Production scale**: Move to V2
- **Advanced optimization**: Move to V2
-
-### Writing Standards
- Timeless principles over current tech
- Every chapter has: Purpose, Learning Outcomes, Summary
- Concrete examples throughout
- "Fallacies and Pitfalls" section
-
---
-
-## 🔧 Tools and Workflow
-
-### Version Control
- [ ] Create `volume-split` branch in Git
- [ ] Track all changes in branch
- [ ] Regular commits with clear messages
-
-### Organization
- `book/volume1/` - Volume 1 chapters
- `book/volume2/` - Volume 2 chapters
- `book/docs/` - All planning documents
- `book/extracted/` - Content extracted from V1
-
-### Quality Checks
- [ ] Weekly page count tracking
- [ ] Monthly cross-reference review
- [ ] Technical accuracy spot checks
- [ ] Pedagogical flow reviews
-
---
-
-## 📞 Stakeholder Communication
-
-### MIT Press Updates
- **Monthly**: Progress report with page counts
- **Major milestones**: Notify when phases complete
- **Issues**: Immediate communication of risks
-
-### Community/Reviewers
- **End Month 2**: Share V1 draft for review
- **End Month 4**: Share V2 draft chapters for review
- **End Month 5**: Full review cycle
-
---
-
-## ✅ Final Checklist (Month 6)
-
-### Volume 1 Completion
- [ ] All 14 chapters present and coherent
- [ ] Page count: 1,150-1,250 pages
- [ ] All cross-references to V2 marked clearly
- [ ] Exercises and quizzes updated
- [ ] Figures and tables numbered correctly
- [ ] Bibliography complete
- [ ] Index prepared
-
-### Volume 2 Completion
- [ ] All 15 chapters present and coherent
- [ ] Page count: 1,100-1,200 pages
- [ ] Bridge chapter effective
- [ ] New chapters integrate extracted content
- [ ] Exercises and quizzes complete
- [ ] Figures and tables numbered correctly
- [ ] Bibliography complete
- [ ] Index prepared
-
-### Both Volumes
- [ ] Consistent notation across volumes
- [ ] No content duplication
- [ ] Clear prerequisite chain
- [ ] Professional copyedit complete
- [ ] Ready for MIT Press submission
-
---
-
-## 📊 Success Metrics
-
-### Quantitative
- Volume 1: 1,150-1,250 pages ✓/✗
- Volume 2: 1,100-1,200 pages ✓/✗
- New content: 325-375 pages ✓/✗
- Timeline: 6 months ✓/✗
-
-### Qualitative
- Each volume independently valuable ✓/✗
- Clear pedagogical progression ✓/✗
- MIT Press approval ✓/✗
- Reviewer feedback positive ✓/✗
-
---
-
-## 🎓 Post-Completion
-
-### Publication Process
- [ ] Submit to MIT Press
- [ ] Incorporate editorial feedback
- [ ] Final production review
- [ ] Marketing materials
- [ ] Course adoption outreach
-
-### Maintenance
- [ ] Errata tracking system
- [ ] Annual review cycle
- [ ] Community feedback integration
- [ ] Future edition planning
-
---
-
-## 📝 Notes and Decisions
-
-### December 2024 - Project Launch
- Decision: Committed to 6-month timeline
- Decision: Will do full surgery, not quick split
- Decision: Flagship quality is priority over speed
- Next decision needed: [track decisions here]
-
---
-
-**Last Updated**: December 7, 2024
-**Status**: Planning Complete - Ready to Begin Execution
-**Next Action**: Begin Chapter 6 (Data Engineering) surgery - Week 1
-
---
-
-*This roadmap is the master coordination document for the two-volume split project. Update weekly with progress, decisions, and course corrections.*
--- a/book/docs/VOLUME_SPLIT_SURGICAL_PLAN.md
+++ b/book/docs/VOLUME_SPLIT_SURGICAL_PLAN.md
--- a/book/docs/VOLUME_STRUCTURE_PROPOSAL.md
+++ b/book/docs/VOLUME_STRUCTURE_PROPOSAL.md
@@ -1,216 +0,0 @@
-# Machine Learning Systems: Two-Volume Structure
-
-**Proposal for MIT Press**
-*Draft: December 2024*
-
---
-
-## Executive Summary
-
-The *Machine Learning Systems* textbook will be published as two complementary volumes of 14 chapters each:
-
-| Volume | Title | Focus | Chapters |
-|--------|-------|-------|----------|
-| **Volume 1** | Introduction to Machine Learning Systems | Complete ML lifecycle, single-system focus | 14 (all existing) |
-| **Volume 2** | Advanced Machine Learning Systems | Principles of scale, distribution, and production | 14 (6 existing, 8 new) |
-
-**Guiding Philosophy:**
- **Volume 1**: Everything you need to build ML systems on a single machine, ending on a positive note with societal impact
- **Volume 2**: Timeless principles for operating ML systems at scale, grounded in physics and mathematics rather than current technologies
-
---
-
-## Volume 1: Introduction to Machine Learning Systems
-
-*The complete ML lifecycle: understand it, build it, optimize it, deploy it, use it for good.*
-
-| Part | Chapter | Description |
-|------|---------|-------------|
-| **Part I: Systems Foundations** | | *What are ML systems?* |
-| | 1. Introduction | Motivation and scope |
-| | 2. ML Systems | System-level view of machine learning |
-| | 3. Deep Learning Primer | Neural network fundamentals |
-| | 4. DNN Architectures | Modern architecture patterns |
-| **Part II: Design Principles** | | *How do you build ML systems?* |
-| | 5. Workflow | End-to-end ML pipeline design |
-| | 6. Data Engineering | Data collection, processing, validation |
-| | 7. Frameworks | PyTorch, TensorFlow, JAX ecosystem |
-| | 8. Training | Training loops, hyperparameters, convergence |
-| **Part III: Performance Engineering** | | *How do you make ML systems fast?* |
-| | 9. Efficient AI | Efficiency principles and metrics |
-| | 10. Optimizations | Quantization, pruning, distillation |
-| | 11. Hardware Acceleration | GPUs, TPUs, custom accelerators |
-| | 12. Benchmarking | Measurement, MLPerf, evaluation methodology |
-| **Part IV: Practice & Impact** | | *How do you deploy and use ML systems responsibly?* |
-| | 13. ML Operations | Deployment, monitoring, CI/CD for ML |
-| | 14. AI for Good | Positive societal applications |
-
-**Total: 14 chapters across 4 parts (all existing content)**
-
-*Early awareness:* include a short Sustainable AI note in Benchmarking or ML Operations to flag energy and carbon impacts without adding another chapter.
-
-### Volume 1 Narrative Arc
-
-The book progresses from understanding → building → optimizing → deploying → impact:
-
-1. **Foundations** establish what ML systems are and why they matter
-2. **Design** teaches how to construct complete pipelines
-3. **Performance** shows how to make systems efficient
-4. **Practice & Impact** completes the lifecycle and ends on an inspirational note
-
-Ending on "AI for Good" leaves students with a positive vision of what they can build.
-
---
-
-## Volume 2: Advanced Machine Learning Systems
-
-*Timeless principles for building and operating ML systems at scale.*
-
-| Part | Chapter | Status | Description |
-|------|---------|--------|-------------|
-| **Part I: Data Movement & Memory** | | | *Moving data is the bottleneck* |
-| | 1. Memory Hierarchies for ML | 🆕 NEW | GPU memory, HBM, activation checkpointing |
-| | 2. Storage Systems for ML | 🆕 NEW | Distributed storage, checkpointing, feature stores |
-| | 3. Communication & Collective Operations | 🆕 NEW | AllReduce, gradient compression, network topology |
-| **Part II: Parallelism & Coordination** | | | *Decomposing computation across machines* |
-| | 4. Distributed Training | 🆕 NEW | Data/model/pipeline/tensor parallelism |
-| | 5. Fault Tolerance & Recovery | 🆕 NEW | Checkpointing, elastic training, failure handling |
-| | 6. Inference Systems | 🆕 NEW | Batching, serving architectures, autoscaling |
-| **Part III: Constrained Environments** | | | *Doing more with less* |
-| | 7. On-device Learning | Existing | Training and adaptation on edge devices |
-| | 8. Edge Deployment | 🆕 NEW | Compilation, runtime optimization, real-time |
-| **Part IV: Adversarial Environments** | | | *Systems under attack and uncertainty* |
-| | 9. Privacy in ML Systems | Existing | Differential privacy, federated learning, secure aggregation |
-| | 10. Security in ML Systems | 🆕 NEW | Supply chain, API security, multi-tenant isolation |
-| | 11. Robust AI | Existing | Adversarial robustness, distribution shift, monitoring |
-| **Part V: Stewardship** | | | *Building systems that serve humanity* |
-| | 12. Responsible AI | Existing | Fairness, accountability, transparency at scale |
-| | 13. Sustainable AI | Existing | Energy efficiency, carbon footprint, environmental impact |
-| | 14. Frontiers & Future Directions | Existing | Emerging paradigms, open problems, conclusion |
-
-**Total: 14 chapters across 5 parts (6 existing, 8 new)**
-
---
-
-## New Content for Volume 2
-
-### Part I: Data Movement & Memory
-*The physics of data movement is the fundamental constraint in modern ML.*
-
-| Chapter | Key Topics | Timeless Principle |
-|---------|------------|-------------------|
-| **Memory Hierarchies for ML** | GPU memory management, HBM architecture, caching strategies, activation checkpointing, memory-efficient attention | Memory bandwidth limits compute utilization |
-| **Storage Systems for ML** | Distributed file systems, checkpoint I/O, feature stores, data lakes, prefetching, I/O scheduling | Storage throughput gates training speed |
-| **Communication & Collective Operations** | AllReduce algorithms, ring/tree topologies, gradient compression, RDMA fundamentals, network topology design | Communication overhead limits scaling |
-
-### Part II: Parallelism & Coordination
-*The mathematics of decomposing work across machines.*
-
-| Chapter | Key Topics | Timeless Principle |
-|---------|------------|-------------------|
-| **Distributed Training** | Data parallelism, model parallelism (tensor, pipeline, expert), hybrid strategies, synchronization, load balancing | Parallelism has fundamental trade-offs |
-| **Fault Tolerance & Recovery** | Checkpoint strategies, async checkpointing, elastic training, failure detection, graceful degradation | Large systems fail; recovery must be designed in |
-| **Inference Systems** | Batching strategies, continuous batching, KV cache management, model serving patterns, autoscaling, SLO management | Serving has different constraints than training |
-
-### Part III: Constrained Environments
-*Operating under resource limitations.*
-
-| Chapter | Key Topics | Timeless Principle |
-|---------|------------|-------------------|
-| **Edge Deployment** | Model compilation, runtime optimization, heterogeneous hardware, real-time constraints, power management | Constraints force creativity |
-
-### Part IV: Adversarial Environments
-*Systems facing attacks, privacy requirements, and uncertainty.*
-
-| Chapter | Key Topics | Timeless Principle |
-|---------|------------|-------------------|
-| **Security in ML Systems** | Model provenance, supply chain security, API protection, multi-tenant isolation, access control | Production systems face adversaries |
-
---
-
-## Design Principles
-
-### Why This Structure Works
-
-**Volume 1 (Single System)**
- Teaches the complete lifecycle
- Everything can be learned and practiced on one machine
- Ends positively with societal impact
-
-**Volume 2 (Distributed Systems)**
- Builds on Volume 1 foundations
- Addresses what changes at scale
- Organized around timeless constraints, not current technologies
-
-### What Makes Volume 2 Timeless
-
-Each part addresses constraints rooted in physics, mathematics, or human nature:
-
-| Part | Eternal Constraint | Foundation |
-|------|-------------------|------------|
-| Data Movement & Memory | Moving data costs more than compute | Physics: speed of light, memory bandwidth |
-| Parallelism & Coordination | Work must be decomposed and synchronized | Mathematics of parallel computation |
-| Constrained Environments | Resources are always finite | Economics and physics |
-| Adversarial Environments | Attackers and uncertainty exist | Human nature, statistics |
-| Stewardship | Technology must serve humanity | Ethics, sustainability |
-
-Chapters use current examples (LLMs, transformers, specific hardware) but frame them as instances of these enduring principles.
-
---
-
-## Content Migration Summary
-
-| Chapter | Volume 1 | Volume 2 | Rationale |
-|---------|----------|----------|-----------|
-| Introduction through Benchmarking | ✓ | | Core technical content |
-| ML Operations | ✓ | | Completes the lifecycle |
-| AI for Good | ✓ | | Positive conclusion |
-| On-device Learning | | ✓ | Edge/constrained is advanced |
-| Privacy & Security | | ✓ | Production security is advanced |
-| Robust AI | | ✓ | Production robustness is advanced |
-| Responsible AI | | ✓ | Scale changes the challenges |
-| Sustainable AI | | ✓ | Datacenter scale is advanced |
-| Frontiers | | ✓ | Conclusion for advanced volume |
-
---
-
-## Audience
-
-| Volume | Primary Audience | Use Cases |
-|--------|-----------------|-----------|
-| Volume 1 | All ML practitioners, undergraduates, bootcamp students | First course in ML systems, self-study |
-| Volume 2 | Infrastructure engineers, graduate students, researchers | Advanced course, reference for practitioners at scale |
-
---
-
-## Collaboration Model
-
-Volume 2's new chapters are candidates for collaborative authorship:
-
-| Topic Area | Ideal Collaborator Profile |
-|------------|---------------------------|
-| Memory & Storage | Datacenter architects, MLPerf Storage contributors |
-| Networking & Communication | Distributed systems researchers, framework developers |
-| Distributed Training | PyTorch/JAX distributed teams, hyperscaler engineers |
-| Fault Tolerance | Site reliability engineers, systems researchers |
-| Inference Systems | ML serving infrastructure engineers |
-| Edge Deployment | Embedded ML practitioners, compiler engineers |
-| Security | ML security researchers, production security engineers |
-
---
-
-## Summary Statistics
-
-| Metric | Volume 1 | Volume 2 |
-|--------|----------|----------|
-| Chapters | 14 | 14 |
-| Parts | 4 | 5 |
-| Existing content | 14 | 6 |
-| New content | 0 | 8 |
-| Focus | Single system | Distributed systems |
-| Prerequisite | None | Volume 1 |
-
---
-
-*Document Version: December 2024*
-*For discussion with MIT Press and potential collaborators*