docs: remove stale volume planning documents

Remove outdated planning documents that have been superseded by
VOLUME_STRUCTURE.md which now serves as the authoritative reference
for the two-volume textbook organization.

Removed files:
- VOLUME_SPLIT_ROADMAP.md
- VOLUME_SPLIT_SURGICAL_PLAN.md
- VOLUME_STRUCTURE_PROPOSAL.md
- reviewer-feedback-synthesis-r1.md
- volume-outline-draft.md
This commit is contained in:
Vijay Janapa Reddi
2026-01-03 09:57:16 -05:00
parent f3a38e32d5
commit ddb8068e6b
5 changed files with 0 additions and 2425 deletions

View File

@@ -1,394 +0,0 @@
# Machine Learning Systems: Two-Volume Split - Master Roadmap
**Project Start Date**: December 2024
**Target Completion**: June 2025 (6 months)
**Goal**: Create flagship two-volume ML Systems textbook series for MIT Press
---
## 📋 Project Documents
### Core Planning Documents
1. **`MIT_PRESS_PROPOSAL.md`** - Official proposal for MIT Press (ready to submit)
2. **`VOLUME_SPLIT_SURGICAL_PLAN.md`** - Section-by-section surgery instructions (50+ pages)
3. **`VOLUME_SPLIT_ROADMAP.md`** - This document: master project plan and progress tracker
4. **`VOLUME_STRUCTURE_PROPOSAL.md`** - Original analysis and proposal
### Supporting Documents
- `VOLUME_SPLIT_ANALYSIS.md` - Deep analysis of pedagogical issues
- `VOLUME_SPLIT_EXECUTIVE_SUMMARY.md` - Quick reference summary
- `DISTRIBUTED_CONTENT_ADDITIONS.md` - Distributed awareness additions for V1
---
## 🎯 Project Vision
**Volume I: Introduction to ML Systems** (~1,150-1,200 pages)
- Complete single-system ML engineering
- Includes distributed awareness (not implementation)
- Target: Undergrads, bootcamps, ML engineers entering the field
**Volume II: Advanced ML Systems** (~1,100-1,150 pages)
- Distributed systems, production scale, responsibility
- Built on timeless principles
- Target: Graduate students, ML infrastructure engineers, senior practitioners
**The One-Liner**:
> "Volume I teaches you to build ML systems that work; Volume II teaches you to build ML systems that scale."
---
## 📊 Current State (December 2024)
### Existing Content
- **Total pages**: 2,172 pages across 22 chapters
- **Status**: Complete draft of comprehensive single-volume book
- **Quality**: Refined through extensive review feedback
### Content Distribution
- **Chapters 1-14**: Form basis of Volume 1 (with surgery)
- **Chapters 15-21**: Move to Volume 2
- **New content needed**: 325-375 pages (8 new chapters for V2)
---
## 🗓️ Six-Month Timeline
### Phase 1: Chapter Surgery (Months 1-2)
**Goal**: Extract distributed content from 7 chapters, create clean V1
#### Month 1: Content Extraction
- **Week 1-2**: Extract distributed content from Chapters 6, 7, 8
- [ ] Chapter 6 (Data Engineering): Extract 40 pages of distributed content
- [ ] Chapter 7 (Frameworks): Extract 50 pages of distributed execution
- [ ] Chapter 8 (Training): Extract 60 pages of distributed training
- **Week 3-4**: Extract from Chapters 10, 11, 12, 13
- [ ] Chapter 10 (Optimizations): Extract 80 pages (NAS, AutoML)
- [ ] Chapter 11 (Hardware): Extract 50 pages (multi-chip)
- [ ] Chapter 12 (Benchmarking): Extract 40 pages (distributed benchmarking)
- [ ] Chapter 13 (MLOps): Extract 30 pages (production scale)
#### Month 2: Bridging and Polish
- **Week 5-6**: Create V1 transitions
- [ ] Add "See Volume 2" callout boxes in V1
- [ ] Write brief distributed awareness sections
- [ ] Ensure V1 chapters remain coherent
- **Week 7-8**: Organize extracted content
- [ ] Create V2 chapter structure
- [ ] Place extracted content in appropriate V2 chapters
- [ ] Identify gaps in extracted content
### Phase 2: New Content Development (Months 3-5)
#### Month 3: Priority 1 Chapters (Essential Infrastructure)
- **Week 9-10**: Memory & Storage
- [ ] V2 Ch1: Memory Hierarchies for ML (45 pages)
- [ ] V2 Ch2: Storage Systems for ML (40 pages)
- **Week 11-12**: Communication & Distributed Training
- [ ] V2 Ch3: Communication & Collective Operations (45 pages)
- [ ] V2 Ch4: Distributed Training Systems (50 pages) - integrate extracted content
#### Month 4: Priority 2 Chapters (Production Requirements)
- **Week 13-14**: Fault Tolerance & Inference
- [ ] V2 Ch5: Fault Tolerance & Resilience (40 pages)
- [ ] V2 Ch6: Inference at Scale (45 pages)
- **Week 15-16**: Integration
- [ ] Integrate all extracted content into new chapters
- [ ] Write chapter introductions and conclusions
- [ ] Create cross-references
#### Month 5: Priority 3 Chapters (Specialized Topics)
- **Week 17-18**: Edge Systems
- [ ] V2 Ch8: Edge Intelligence Systems (50 pages)
- [ ] Integrate extracted edge content from Ch2
- **Week 19-20**: Final new chapters
- [ ] V2 Ch1: Bridge Chapter (30 pages) - From Single to Distributed
- [ ] Update existing V2 chapters (15-20) with introductions
### Phase 3: Integration and Polish (Month 6)
#### Month 6: Final Integration
- **Week 21-22**: Cross-References and Consistency
- [ ] Update all V1→V2 cross-references
- [ ] Update all V2→V1 prerequisite references
- [ ] Ensure consistent notation across volumes
- [ ] Verify all figure references work
- **Week 23**: Narrative Flow
- [ ] Review V1 narrative arc (Foundations → Building → Optimizing → Impact)
- [ ] Review V2 narrative arc (Scale → Production → Responsibility)
- [ ] Polish chapter transitions
- [ ] Write volume prefaces
- **Week 24**: Final Quality Checks
- [ ] Technical accuracy review
- [ ] Page count verification
- [ ] Exercise and quiz consistency
- [ ] Final copyedit pass
- [ ] Prepare camera-ready manuscripts
---
## 📈 Progress Tracking
### Volume 1 Progress (Target: 1,150-1,200 pages)
| Chapter | Current | Target | Surgery Status | Notes |
|---------|---------|--------|----------------|-------|
| 1. Introduction | 90 | 60 | ⬜ Not Started | Compress history section |
| 2. ML Systems | 70 | 70 | ⬜ Not Started | Extract hybrid architectures |
| 3. DL Primer | 110 | 100 | ⬜ Not Started | No surgery (keep as-is) |
| 4. DNN Architectures | 82 | 100 | ⬜ Not Started | No surgery (keep as-is) |
| 5. Workflow | 51 | 40 | ⬜ Not Started | Minor compression |
| 6. Data Engineering | 138 | 80 | ⬜ Not Started | Extract distributed storage |
| 7. Frameworks | 121 | 100 | ⬜ Not Started | Extract distributed execution |
| 8. Training | 157 | 100 | ⬜ Not Started | Extract distributed training |
| 9. Efficient AI | 52 | 60 | ⬜ Not Started | No surgery (keep as-is) |
| 10. Optimizations | 160 | 120 | ⬜ Not Started | Extract NAS, AutoML |
| 11. Hardware | 181 | 90 | ⬜ Not Started | Extract multi-chip |
| 12. Benchmarking | 124 | 80 | ⬜ Not Started | Extract distributed benchmarking |
| 13. MLOps | 126 | 50 | ⬜ Not Started | Extract production scale |
| 14. AI for Good | 84 | 50 | ⬜ Not Started | Minor compression |
| **TOTAL** | **1,546** | **1,100** | **0%** | **V1 baseline complete** |
Progress Key: ⬜ Not Started | 🟨 In Progress | ✅ Complete
### Volume 2 Progress (Target: 1,100-1,150 pages)
| Chapter | Source | Target | Status | Notes |
|---------|--------|--------|--------|-------|
| 1. Bridge: Single to Distributed | NEW | 30 | ⬜ Not Started | Write from scratch |
| 2. Memory Hierarchies | NEW + Ch11 | 45 | ⬜ Not Started | New content + extracts |
| 3. Storage Systems | NEW + Ch6 | 40 | ⬜ Not Started | New content + extracts |
| 4. Communication & Collectives | NEW + Ch8 | 45 | ⬜ Not Started | New content + extracts |
| 5. Distributed Training | Ch8 + Ch10 | 50 | ⬜ Not Started | Consolidate extracts |
| 6. Fault Tolerance | NEW + Ch13 | 40 | ⬜ Not Started | New content + extracts |
| 7. Inference at Scale | NEW + Ch2 + Ch13 | 45 | ⬜ Not Started | New content + extracts |
| 8. Edge Intelligence | NEW + Ch2 | 50 | ⬜ Not Started | New content + extracts |
| 9. On-Device Learning | Ch14 (existing) | 127 | ⬜ Not Started | Move from V1 |
| 10. Privacy Systems | Ch15 (split) | 65 | ⬜ Not Started | Split Privacy/Security |
| 11. Security Systems | Ch15 (split) | 68 | ⬜ Not Started | Split Privacy/Security |
| 12. Robust AI | Ch16 (existing) | 137 | ⬜ Not Started | Move from V1 |
| 13. Responsible AI | Ch17 (existing) | 135 | ⬜ Not Started | Move from V1 |
| 14. Sustainable AI | Ch18 (existing) | 46 | ⬜ Not Started | Move from V1 |
| 15. Frontiers & AGI | Ch19+20 (merge) | 78 | ⬜ Not Started | Merge two chapters |
| **TOTAL** | **Mixed** | **1,001** | **0%** | **Need ~100 more pages** |
### New Content Writing Progress (325-375 pages needed)
| Chapter | Pages | Draft | Review | Final | Notes |
|---------|-------|-------|--------|-------|-------|
| Bridge Chapter | 30 | ⬜ | ⬜ | ⬜ | Priority 1 |
| Memory Hierarchies | 45 | ⬜ | ⬜ | ⬜ | Priority 1 |
| Storage Systems | 40 | ⬜ | ⬜ | ⬜ | Priority 1 |
| Communication | 45 | ⬜ | ⬜ | ⬜ | Priority 2 |
| Distributed Training | 50 | ⬜ | ⬜ | ⬜ | Priority 1 (+ extracts) |
| Fault Tolerance | 40 | ⬜ | ⬜ | ⬜ | Priority 2 |
| Inference at Scale | 45 | ⬜ | ⬜ | ⬜ | Priority 1 |
| Edge Intelligence | 50 | ⬜ | ⬜ | ⬜ | Priority 3 |
| **TOTAL NEW** | **345** | **0%** | **0%** | **0%** | 8 new chapters |
---
## 📝 Weekly Progress Log
### Week 1 (Dec 2-8, 2024)
- [ ] Created comprehensive surgical plan
- [ ] Created MIT Press proposal
- [ ] Created master roadmap
- [ ] **Next**: Begin Chapter 6 surgery
### Week 2 (Dec 9-15, 2024)
- [ ] Progress notes...
### Week 3 (Dec 16-22, 2024)
- [ ] Progress notes...
*(Continue weekly log throughout 6-month period)*
---
## 🎯 Key Milestones
- [ ] **End Month 1**: All distributed content extracted from V1 chapters
- [ ] **End Month 2**: V1 chapters coherent and complete
- [ ] **End Month 3**: Priority 1 new chapters drafted (160 pages)
- [ ] **End Month 4**: Priority 2 new chapters drafted (85 pages)
- [ ] **End Month 5**: All new content drafted (345 pages)
- [ ] **End Month 6**: Camera-ready manuscripts for both volumes
---
## ⚠️ Risks and Mitigation
### Risk 1: Page Count Imbalance
**Risk**: V1 or V2 ends up significantly larger/smaller than target
**Mitigation**:
- Monitor page counts weekly
- Adjust compression/expansion as needed
- Have flexibility targets (±100 pages)
### Risk 2: Missing Dependencies
**Risk**: V2 assumes V1 knowledge not actually covered
**Mitigation**:
- Create prerequisite matrix
- Add recap sections to V2 chapters
- Review cross-references monthly
### Risk 3: Timeline Slippage
**Risk**: New chapter writing takes longer than estimated
**Mitigation**:
- Prioritize essential chapters first
- Have backup plan to defer Priority 3 chapters
- Build 2-week buffer into timeline
### Risk 4: Content Duplication
**Risk**: Same concept explained in both volumes
**Mitigation**:
- Clear "basic vs. advanced" delineation
- V2 references V1 explicitly
- Review for overlap in Month 6
---
## 📚 Reference Materials
### Pedagogical Framework
- **V1 Narrative**: Foundations → Building → Optimizing → Impact
- **V2 Narrative**: Scale → Production → Responsibility
- **Connection**: V1 ends with inspiration, V2 begins with bridge
### Chapter Surgery Guidelines
- **Single-machine boundary**: Keep in V1
- **Distributed systems**: Move to V2
- **Production scale**: Move to V2
- **Advanced optimization**: Move to V2
### Writing Standards
- Timeless principles over current tech
- Every chapter has: Purpose, Learning Outcomes, Summary
- Concrete examples throughout
- "Fallacies and Pitfalls" section
---
## 🔧 Tools and Workflow
### Version Control
- [ ] Create `volume-split` branch in Git
- [ ] Track all changes in branch
- [ ] Regular commits with clear messages
### Organization
- `book/volume1/` - Volume 1 chapters
- `book/volume2/` - Volume 2 chapters
- `book/docs/` - All planning documents
- `book/extracted/` - Content extracted from V1
### Quality Checks
- [ ] Weekly page count tracking
- [ ] Monthly cross-reference review
- [ ] Technical accuracy spot checks
- [ ] Pedagogical flow reviews
---
## 📞 Stakeholder Communication
### MIT Press Updates
- **Monthly**: Progress report with page counts
- **Major milestones**: Notify when phases complete
- **Issues**: Immediate communication of risks
### Community/Reviewers
- **End Month 2**: Share V1 draft for review
- **End Month 4**: Share V2 draft chapters for review
- **End Month 5**: Full review cycle
---
## ✅ Final Checklist (Month 6)
### Volume 1 Completion
- [ ] All 14 chapters present and coherent
- [ ] Page count: 1,150-1,250 pages
- [ ] All cross-references to V2 marked clearly
- [ ] Exercises and quizzes updated
- [ ] Figures and tables numbered correctly
- [ ] Bibliography complete
- [ ] Index prepared
### Volume 2 Completion
- [ ] All 15 chapters present and coherent
- [ ] Page count: 1,100-1,200 pages
- [ ] Bridge chapter effective
- [ ] New chapters integrate extracted content
- [ ] Exercises and quizzes complete
- [ ] Figures and tables numbered correctly
- [ ] Bibliography complete
- [ ] Index prepared
### Both Volumes
- [ ] Consistent notation across volumes
- [ ] No content duplication
- [ ] Clear prerequisite chain
- [ ] Professional copyedit complete
- [ ] Ready for MIT Press submission
---
## 📊 Success Metrics
### Quantitative
- Volume 1: 1,150-1,250 pages ✓/✗
- Volume 2: 1,100-1,200 pages ✓/✗
- New content: 325-375 pages ✓/✗
- Timeline: 6 months ✓/✗
### Qualitative
- Each volume independently valuable ✓/✗
- Clear pedagogical progression ✓/✗
- MIT Press approval ✓/✗
- Reviewer feedback positive ✓/✗
---
## 🎓 Post-Completion
### Publication Process
- [ ] Submit to MIT Press
- [ ] Incorporate editorial feedback
- [ ] Final production review
- [ ] Marketing materials
- [ ] Course adoption outreach
### Maintenance
- [ ] Errata tracking system
- [ ] Annual review cycle
- [ ] Community feedback integration
- [ ] Future edition planning
---
## 📝 Notes and Decisions
### December 2024 - Project Launch
- Decision: Committed to 6-month timeline
- Decision: Will do full surgery, not quick split
- Decision: Flagship quality is priority over speed
- Next decision needed: [track decisions here]
---
**Last Updated**: December 7, 2024
**Status**: Planning Complete - Ready to Begin Execution
**Next Action**: Begin Chapter 6 (Data Engineering) surgery - Week 1
---
*This roadmap is the master coordination document for the two-volume split project. Update weekly with progress, decisions, and course corrections.*

File diff suppressed because it is too large Load Diff

View File

@@ -1,216 +0,0 @@
# Machine Learning Systems: Two-Volume Structure
**Proposal for MIT Press**
*Draft: December 2024*
---
## Executive Summary
The *Machine Learning Systems* textbook will be published as two complementary volumes of 14 chapters each:
| Volume | Title | Focus | Chapters |
|--------|-------|-------|----------|
| **Volume 1** | Introduction to Machine Learning Systems | Complete ML lifecycle, single-system focus | 14 (all existing) |
| **Volume 2** | Advanced Machine Learning Systems | Principles of scale, distribution, and production | 14 (6 existing, 8 new) |
**Guiding Philosophy:**
- **Volume 1**: Everything you need to build ML systems on a single machine, ending on a positive note with societal impact
- **Volume 2**: Timeless principles for operating ML systems at scale, grounded in physics and mathematics rather than current technologies
---
## Volume 1: Introduction to Machine Learning Systems
*The complete ML lifecycle: understand it, build it, optimize it, deploy it, use it for good.*
| Part | Chapter | Description |
|------|---------|-------------|
| **Part I: Systems Foundations** | | *What are ML systems?* |
| | 1. Introduction | Motivation and scope |
| | 2. ML Systems | System-level view of machine learning |
| | 3. Deep Learning Primer | Neural network fundamentals |
| | 4. DNN Architectures | Modern architecture patterns |
| **Part II: Design Principles** | | *How do you build ML systems?* |
| | 5. Workflow | End-to-end ML pipeline design |
| | 6. Data Engineering | Data collection, processing, validation |
| | 7. Frameworks | PyTorch, TensorFlow, JAX ecosystem |
| | 8. Training | Training loops, hyperparameters, convergence |
| **Part III: Performance Engineering** | | *How do you make ML systems fast?* |
| | 9. Efficient AI | Efficiency principles and metrics |
| | 10. Optimizations | Quantization, pruning, distillation |
| | 11. Hardware Acceleration | GPUs, TPUs, custom accelerators |
| | 12. Benchmarking | Measurement, MLPerf, evaluation methodology |
| **Part IV: Practice & Impact** | | *How do you deploy and use ML systems responsibly?* |
| | 13. ML Operations | Deployment, monitoring, CI/CD for ML |
| | 14. AI for Good | Positive societal applications |
**Total: 14 chapters across 4 parts (all existing content)**
*Early awareness:* include a short Sustainable AI note in Benchmarking or ML Operations to flag energy and carbon impacts without adding another chapter.
### Volume 1 Narrative Arc
The book progresses from understanding → building → optimizing → deploying → impact:
1. **Foundations** establish what ML systems are and why they matter
2. **Design** teaches how to construct complete pipelines
3. **Performance** shows how to make systems efficient
4. **Practice & Impact** completes the lifecycle and ends on an inspirational note
Ending on "AI for Good" leaves students with a positive vision of what they can build.
---
## Volume 2: Advanced Machine Learning Systems
*Timeless principles for building and operating ML systems at scale.*
| Part | Chapter | Status | Description |
|------|---------|--------|-------------|
| **Part I: Data Movement & Memory** | | | *Moving data is the bottleneck* |
| | 1. Memory Hierarchies for ML | 🆕 NEW | GPU memory, HBM, activation checkpointing |
| | 2. Storage Systems for ML | 🆕 NEW | Distributed storage, checkpointing, feature stores |
| | 3. Communication & Collective Operations | 🆕 NEW | AllReduce, gradient compression, network topology |
| **Part II: Parallelism & Coordination** | | | *Decomposing computation across machines* |
| | 4. Distributed Training | 🆕 NEW | Data/model/pipeline/tensor parallelism |
| | 5. Fault Tolerance & Recovery | 🆕 NEW | Checkpointing, elastic training, failure handling |
| | 6. Inference Systems | 🆕 NEW | Batching, serving architectures, autoscaling |
| **Part III: Constrained Environments** | | | *Doing more with less* |
| | 7. On-device Learning | Existing | Training and adaptation on edge devices |
| | 8. Edge Deployment | 🆕 NEW | Compilation, runtime optimization, real-time |
| **Part IV: Adversarial Environments** | | | *Systems under attack and uncertainty* |
| | 9. Privacy in ML Systems | Existing | Differential privacy, federated learning, secure aggregation |
| | 10. Security in ML Systems | 🆕 NEW | Supply chain, API security, multi-tenant isolation |
| | 11. Robust AI | Existing | Adversarial robustness, distribution shift, monitoring |
| **Part V: Stewardship** | | | *Building systems that serve humanity* |
| | 12. Responsible AI | Existing | Fairness, accountability, transparency at scale |
| | 13. Sustainable AI | Existing | Energy efficiency, carbon footprint, environmental impact |
| | 14. Frontiers & Future Directions | Existing | Emerging paradigms, open problems, conclusion |
**Total: 14 chapters across 5 parts (6 existing, 8 new)**
---
## New Content for Volume 2
### Part I: Data Movement & Memory
*The physics of data movement is the fundamental constraint in modern ML.*
| Chapter | Key Topics | Timeless Principle |
|---------|------------|-------------------|
| **Memory Hierarchies for ML** | GPU memory management, HBM architecture, caching strategies, activation checkpointing, memory-efficient attention | Memory bandwidth limits compute utilization |
| **Storage Systems for ML** | Distributed file systems, checkpoint I/O, feature stores, data lakes, prefetching, I/O scheduling | Storage throughput gates training speed |
| **Communication & Collective Operations** | AllReduce algorithms, ring/tree topologies, gradient compression, RDMA fundamentals, network topology design | Communication overhead limits scaling |
### Part II: Parallelism & Coordination
*The mathematics of decomposing work across machines.*
| Chapter | Key Topics | Timeless Principle |
|---------|------------|-------------------|
| **Distributed Training** | Data parallelism, model parallelism (tensor, pipeline, expert), hybrid strategies, synchronization, load balancing | Parallelism has fundamental trade-offs |
| **Fault Tolerance & Recovery** | Checkpoint strategies, async checkpointing, elastic training, failure detection, graceful degradation | Large systems fail; recovery must be designed in |
| **Inference Systems** | Batching strategies, continuous batching, KV cache management, model serving patterns, autoscaling, SLO management | Serving has different constraints than training |
### Part III: Constrained Environments
*Operating under resource limitations.*
| Chapter | Key Topics | Timeless Principle |
|---------|------------|-------------------|
| **Edge Deployment** | Model compilation, runtime optimization, heterogeneous hardware, real-time constraints, power management | Constraints force creativity |
### Part IV: Adversarial Environments
*Systems facing attacks, privacy requirements, and uncertainty.*
| Chapter | Key Topics | Timeless Principle |
|---------|------------|-------------------|
| **Security in ML Systems** | Model provenance, supply chain security, API protection, multi-tenant isolation, access control | Production systems face adversaries |
---
## Design Principles
### Why This Structure Works
**Volume 1 (Single System)**
- Teaches the complete lifecycle
- Everything can be learned and practiced on one machine
- Ends positively with societal impact
**Volume 2 (Distributed Systems)**
- Builds on Volume 1 foundations
- Addresses what changes at scale
- Organized around timeless constraints, not current technologies
### What Makes Volume 2 Timeless
Each part addresses constraints rooted in physics, mathematics, or human nature:
| Part | Eternal Constraint | Foundation |
|------|-------------------|------------|
| Data Movement & Memory | Moving data costs more than compute | Physics: speed of light, memory bandwidth |
| Parallelism & Coordination | Work must be decomposed and synchronized | Mathematics of parallel computation |
| Constrained Environments | Resources are always finite | Economics and physics |
| Adversarial Environments | Attackers and uncertainty exist | Human nature, statistics |
| Stewardship | Technology must serve humanity | Ethics, sustainability |
Chapters use current examples (LLMs, transformers, specific hardware) but frame them as instances of these enduring principles.
---
## Content Migration Summary
| Chapter | Volume 1 | Volume 2 | Rationale |
|---------|----------|----------|-----------|
| Introduction through Benchmarking | ✓ | | Core technical content |
| ML Operations | ✓ | | Completes the lifecycle |
| AI for Good | ✓ | | Positive conclusion |
| On-device Learning | | ✓ | Edge/constrained is advanced |
| Privacy & Security | | ✓ | Production security is advanced |
| Robust AI | | ✓ | Production robustness is advanced |
| Responsible AI | | ✓ | Scale changes the challenges |
| Sustainable AI | | ✓ | Datacenter scale is advanced |
| Frontiers | | ✓ | Conclusion for advanced volume |
---
## Audience
| Volume | Primary Audience | Use Cases |
|--------|-----------------|-----------|
| Volume 1 | All ML practitioners, undergraduates, bootcamp students | First course in ML systems, self-study |
| Volume 2 | Infrastructure engineers, graduate students, researchers | Advanced course, reference for practitioners at scale |
---
## Collaboration Model
Volume 2's new chapters are candidates for collaborative authorship:
| Topic Area | Ideal Collaborator Profile |
|------------|---------------------------|
| Memory & Storage | Datacenter architects, MLPerf Storage contributors |
| Networking & Communication | Distributed systems researchers, framework developers |
| Distributed Training | PyTorch/JAX distributed teams, hyperscaler engineers |
| Fault Tolerance | Site reliability engineers, systems researchers |
| Inference Systems | ML serving infrastructure engineers |
| Edge Deployment | Embedded ML practitioners, compiler engineers |
| Security | ML security researchers, production security engineers |
---
## Summary Statistics
| Metric | Volume 1 | Volume 2 |
|--------|----------|----------|
| Chapters | 14 | 14 |
| Parts | 4 | 5 |
| Existing content | 14 | 6 |
| New content | 0 | 8 |
| Focus | Single system | Distributed systems |
| Prerequisite | None | Volume 1 |
---
*Document Version: December 2024*
*For discussion with MIT Press and potential collaborators*

View File

@@ -1,175 +0,0 @@
# Round 1 Reviewer Feedback Synthesis
## Reviewers
- **David Patterson** - Computer architecture, textbook author
- **Ion Stoica** - Distributed systems (Ray, Spark), Berkeley
- **Vijay Reddi** - TinyML, MLPerf, Harvard
- **Jeff Dean** - Google Senior Fellow, large-scale systems
---
## Consensus Issues (All/Most Reviewers Agree)
### 1. Volume I is Incomplete Without Some Production/Scale Awareness
**Patterson**: "Cannot teach deployment without responsibility integration"
**Stoica**: "Data parallelism is now foundational, not advanced"
**Reddi**: "Edge deployment is fundamental, not advanced"
**Dean**: "Scale thinking should be woven throughout Vol I, not deferred"
**Consensus**: Vol I currently produces graduates who lack awareness of:
- Distributed training basics (Stoica, Dean)
- Resource constraints/edge deployment (Reddi)
- Responsible practices (Patterson)
- Cost and production realities (Dean)
### 2. Chapter 14 "Preview" Approach is Problematic
**Patterson**: "Pedagogically misguided - responsibility should be integrated throughout"
**All others**: Generally agree preview is insufficient
**Consensus**: The "preview" approach treats important topics as afterthoughts.
### 3. Volume II Part I/II Ordering Needs Work
**Stoica**: "Teaching infrastructure before algorithms is pedagogically backwards"
**Dean**: Agrees infra should come with context
**Consensus**: Teach distributed algorithms first, then infrastructure that supports them.
### 4. Missing Hands-On/Practical Content
**Patterson**: "No mention of labs or programming assignments"
**Reddi**: "Students cannot deploy to microcontroller after Vol I"
**Dean**: "Missing debugging and profiling skills"
**Consensus**: Both volumes need explicit practical components.
---
## Key Disagreements/Different Emphases
### What Should Move to Volume I?
| Topic | Patterson | Stoica | Reddi | Dean |
|-------|-----------|--------|-------|------|
| Edge/TinyML deployment | Maybe | - | **CRITICAL** | - |
| Data parallelism basics | - | **CRITICAL** | - | Important |
| Checkpointing | - | **CRITICAL** | - | Important |
| Responsible AI integration | **CRITICAL** | - | - | - |
| Cost awareness | - | - | - | **CRITICAL** |
**Tension**: Each reviewer wants their specialty area elevated in Vol I. Cannot add everything without making Vol I too large.
### Chapter Count
- **Patterson**: 14 chapters OK if balanced, concerned about page counts
- **Reddi**: Suggests 15 chapters (add edge deployment)
- **Stoica/Dean**: Focus less on count, more on content depth
---
## Specific Recommendations by Volume
### Volume I Additions (Ranked by Consensus)
| Priority | Addition | Supporters |
|----------|----------|------------|
| HIGH | Data parallelism basics in Ch 8 (Training) | Stoica, Dean |
| HIGH | Checkpointing basics in Ch 8 | Stoica, Dean |
| HIGH | Resource-constrained deployment chapter | Reddi (strong), Patterson (partial) |
| HIGH | Cost/efficiency awareness throughout | Dean |
| MEDIUM | Integrate responsibility throughout (not just Ch 14) | Patterson |
| MEDIUM | Expand quantization/pruning depth (Ch 10) | Reddi |
| MEDIUM | Strengthen benchmarking rigor (Ch 12) | Reddi |
### Volume II Restructuring
| Priority | Change | Supporters |
|----------|--------|------------|
| HIGH | Reorder Parts I/II (algorithms before infrastructure) | Stoica, Dean |
| HIGH | Add distributed systems theory basics | Stoica |
| MEDIUM | Add production debugging chapter | Dean, Stoica |
| MEDIUM | Expand MLOps chapter significantly | Dean |
| MEDIUM | Add cost/resource management | Dean |
---
## The Central Dilemma
**Cannot add everything without making volumes too large.**
Options:
### Option A: Minimal Vol I Changes (Original + Small Additions)
- Keep current 14-chapter structure
- Add data parallelism section to Ch 8
- Add checkpointing section to Ch 8
- Strengthen Ch 14 preview (but keep as preview)
- Vol II restructures Part I/II ordering
**Pro**: Minimal disruption, faster to implement
**Con**: Patterson and Reddi concerns not fully addressed
### Option B: Add Edge Deployment to Vol I (Reddi's Recommendation)
- Add new Chapter 12: "Resource-Constrained Deployment"
- Renumber remaining chapters (15 total)
- Expand quantization/pruning depth
- Vol II restructures Part I/II
**Pro**: Addresses critical industry need (mobile/embedded)
**Con**: Makes Vol I larger, may be too ambitious
### Option C: Integrate Responsibility Throughout Vol I (Patterson's Recommendation)
- Distribute responsible systems content across chapters
- Remove standalone Ch 14 preview
- Add fairness to Ch 6 (Data), security to Ch 10, sustainability to Ch 9
- Keep 14 chapters but redistribute content
**Pro**: Pedagogically sounder integration
**Con**: Significant rewrite of multiple chapters
### Option D: Hybrid - Core Additions Only
- Add data parallelism + checkpointing to Ch 8 (Stoica/Dean consensus)
- Add brief edge deployment section to Ch 13 (MLOps) not new chapter
- Keep Ch 14 but strengthen it with integrated callouts in earlier chapters
- Vol II restructures Part I/II
**Pro**: Addresses highest-consensus items without major restructure
**Con**: Doesn't fully satisfy any single reviewer
---
## Recommendation for User Decision
**Suggested path forward**: Option D (Hybrid) for Vol I structure, with Vol II restructuring.
**Rationale**:
1. Stoica and Dean (industry leaders in distributed systems) agree on data parallelism/checkpointing - this is the highest consensus item
2. Full edge deployment chapter (Reddi) is valuable but may be too ambitious for immediate restructure
3. Full responsibility integration (Patterson) is pedagogically ideal but requires significant rewrite
4. Vol II restructuring (algorithms before infrastructure) has clear consensus
**What this means for your current draft**:
**Volume I Changes**:
- Ch 8 (Training): Add "Distributed Training Fundamentals" section
- Ch 8 (Training): Add "Checkpointing" section
- Ch 9 (Efficiency): Add brief energy/sustainability measurement
- Ch 14 (Preview): Strengthen, add forward references throughout earlier chapters
**Volume II Changes**:
- Reorder: Distributed Training → Communication → Fault Tolerance → THEN Infrastructure
- Add brief theory section to Ch 1
---
## Questions for User
1. **Edge deployment priority**: Is adding a full edge deployment chapter to Vol I worth the extra scope? (Reddi makes a strong case for industry relevance)
2. **Responsibility integration**: Should we integrate responsible AI throughout Vol I chapters (Patterson's strong recommendation), or keep the preview approach?
3. **Page count targets**: Do you have MIT Press guidance on target page counts? This affects how much we can add.
4. **Volume II priority**: Is restructuring Vol II Part I/II ordering acceptable, or is that structure already locked?

View File

@@ -1,207 +0,0 @@
# Machine Learning Systems: Two-Volume Structure
**Status**: Approved (Round 2 Review Complete)
**Target Publisher**: MIT Press
**Audience**: Undergraduate and graduate CS/ECE students, academic courses
---
## Overview
This textbook is being split into two volumes to serve different learning objectives:
- **Volume I: Introduction to ML Systems** - Foundational knowledge for building, optimizing, and deploying ML systems
- **Volume II: Advanced ML Systems** - Scale, distributed systems, production hardening, and responsible deployment
Each volume should stand alone as a complete learning experience while together forming a comprehensive treatment of the field.
---
## Volume I: Introduction to ML Systems
### Goal
A reader completes Volume I and can competently build, optimize, and deploy ML systems with awareness of responsible practices.
### Target Audience
- Upper-level undergraduates
- Early graduate students
- Practitioners transitioning into ML systems
### Course Mapping
- Single semester "Introduction to ML Systems" course
- Foundation for more advanced distributed systems or MLOps courses
### Structure (14 chapters)
#### Part I: Foundations
Establish the conceptual framework for understanding ML as a systems discipline.
| Ch | Title | Purpose |
|----|-------|---------|
| 1 | Introduction to Machine Learning Systems | Why ML systems thinking matters |
| 2 | The ML Systems Landscape | Survey of the field, key components |
| 3 | Deep Learning Foundations | Mathematical and conceptual foundations |
| 4 | Modern Neural Architectures | CNNs, RNNs, Transformers, architectural choices |
#### Part II: Development
Practical skills for constructing ML systems from data to trained model.
| Ch | Title | Purpose |
|----|-------|---------|
| 5 | ML Development Workflow | End-to-end process, experimentation |
| 6 | Data Engineering for ML | Pipelines, preprocessing, data quality |
| 7 | ML Frameworks and Tools | PyTorch, TensorFlow, ecosystem |
| 8 | Training Systems | Training loops, distributed basics, debugging |
#### Part III: Optimization
Techniques for making ML systems efficient and fast.
| Ch | Title | Purpose |
|----|-------|---------|
| 9 | Efficiency in AI Systems | Why efficiency matters, metrics |
| 10 | Model Optimization Techniques | Quantization, pruning, distillation |
| 11 | Hardware Acceleration | GPUs, TPUs, custom accelerators |
| 12 | Benchmarking and Evaluation | Measuring performance, MLPerf |
#### Part IV: Operations
Getting models into production responsibly.
| Ch | Title | Purpose |
|----|-------|---------|
| 13 | ML Operations Fundamentals | Deployment, monitoring, CI/CD for ML |
| 14 | Responsible Systems Preview | Brief intro to robustness, security, fairness, sustainability (preview of Vol II topics) |
#### Volume I Conclusion
Synthesis, what was learned, bridge to Volume II.
---
## Volume II: Advanced ML Systems
### Goal
A reader completes Volume II understanding how to build and operate ML systems at scale, with production resilience and responsible practices.
### Target Audience
- Graduate students
- Industry practitioners
- Researchers building large-scale systems
### Prerequisites
- Volume I or equivalent knowledge
- Basic distributed systems concepts helpful
### Course Mapping
- Graduate seminar on large-scale ML systems
- Advanced MLOps course
- Research group reading material
### Structure (16 chapters)
#### Part I: Foundations of Scale
Infrastructure and concepts for scaling beyond single machines.
| Ch | Title | Purpose |
|----|-------|---------|
| 1 | From Single Systems to Planetary Scale | Motivation, challenges of scale |
| 2 | Infrastructure for Large-Scale ML | Clusters, cloud, resource management |
| 3 | Storage Systems for ML | Data lakes, distributed storage, checkpointing |
| 4 | Communication and Collective Operations | AllReduce, parameter servers, network topology |
#### Part II: Distributed Systems
Training and inference across multiple machines.
| Ch | Title | Purpose |
|----|-------|---------|
| 5 | Distributed Training Systems | Data parallel, model parallel, pipeline parallel |
| 6 | Fault Tolerance and Resilience | Checkpointing, recovery, handling failures |
| 7 | Inference at Scale | Serving systems, batching, latency optimization |
| 8 | Edge Intelligence Systems | Deploying ML at the edge, constraints |
#### Part III: Production Challenges
Real-world complexities of operating ML systems.
| Ch | Title | Purpose |
|----|-------|---------|
| 9 | On-Device Learning | Training on edge devices, federated learning |
| 10 | Privacy-Preserving ML Systems | Differential privacy, secure computation |
| 11 | Robust and Reliable AI | Adversarial robustness, distribution shift |
| 12 | ML Operations at Scale | Advanced MLOps, platform engineering |
#### Part IV: Responsible Deployment
Building ML systems that benefit society.
| Ch | Title | Purpose |
|----|-------|---------|
| 13 | Responsible AI Systems | Fairness, accountability, transparency |
| 14 | Sustainable AI | Environmental impact, efficient computing |
| 15 | AI for Good | Applications for societal benefit |
| 16 | Frontiers and Future Directions | Emerging trends, open problems |
#### Volume II Conclusion
Synthesis, future of the field, call to action.
---
## Key Design Decisions
### Why This Split?
1. **Pedagogical Progression**: Vol I covers what every ML practitioner needs. Vol II covers what scale/production engineers need.
2. **Course Adoptability**: Vol I maps to a single semester intro course. Vol II maps to an advanced graduate seminar.
3. **Standalone Completeness**: A reader of only Vol I still gets responsible systems awareness through Chapter 14.
4. **Industry Alignment**: Vol I produces capable junior engineers. Vol II produces senior/staff-level systems thinkers.
### Chapter 14 in Volume I: Responsible Systems Preview
This chapter is intentionally brief (preview, not deep dive) covering:
- Robustness basics (models can fail)
- Security basics (models can be attacked)
- Fairness basics (models can discriminate)
- Sustainability basics (training has environmental cost)
Each topic points to the relevant Volume II chapter for deep treatment. This ensures Vol I readers are aware of these concerns without duplicating Vol II content.
### What Moves Between Volumes?
From original single-book structure:
- On-Device Learning → Vol II (requires scale context)
- Privacy/Security → Vol II (production concern)
- Robust AI → Vol II (advanced topic)
- Responsible AI → Vol II (deep treatment)
- Sustainable AI → Vol II (deep treatment)
- AI for Good → Vol II (capstone application)
- Frontiers → Vol II (forward-looking capstone)
---
## Questions for Reviewers
1. Does Volume I stand alone as a complete, responsible introduction to ML systems?
2. Is the progression within each volume logical for students?
3. Would you adopt Volume I for an introductory ML systems course?
4. Is Chapter 14 (Responsible Systems Preview) sufficient, or should Vol I include more depth on any topic?
5. Are any chapters misplaced between volumes?
6. Is 14 chapters (Vol I) and 16 chapters (Vol II) appropriate sizing?
7. What's missing from either volume?
---
## Revision History
- **v0.1** (2024-12-31): Initial draft for review
- **v0.2** (2024-12-31): Updated Part names based on reviewer feedback
- Volume I: Single-word Part names (Foundations, Development, Optimization, Operations)
- Volume II: Two-word Part names (unchanged, already clear)
- **v1.0** (2024-12-31): Structure approved after Round 2 review
- All reviewers (Patterson, Stoica, Reddi, Dean) approve chapter structure
- Part naming convention approved
- Ready for website implementation