mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
chore: remove volume-splitting strategy docs from public repo
These files contain internal planning documents that should not be public: - VOLUME_SPLIT_ROADMAP.md (project timeline and MIT Press plans) - VOLUME_SPLIT_SURGICAL_PLAN.md (section-by-section surgery) - VOLUME_STRUCTURE_PROPOSAL.md (publisher proposal draft)
This commit is contained in:
@@ -1,394 +0,0 @@
|
||||
# Machine Learning Systems: Two-Volume Split - Master Roadmap
|
||||
|
||||
**Project Start Date**: December 2024
|
||||
**Target Completion**: June 2025 (6 months)
|
||||
**Goal**: Create flagship two-volume ML Systems textbook series for MIT Press
|
||||
|
||||
---
|
||||
|
||||
## 📋 Project Documents
|
||||
|
||||
### Core Planning Documents
|
||||
1. **`MIT_PRESS_PROPOSAL.md`** - Official proposal for MIT Press (ready to submit)
|
||||
2. **`VOLUME_SPLIT_SURGICAL_PLAN.md`** - Section-by-section surgery instructions (50+ pages)
|
||||
3. **`VOLUME_SPLIT_ROADMAP.md`** - This document: master project plan and progress tracker
|
||||
4. **`VOLUME_STRUCTURE_PROPOSAL.md`** - Original analysis and proposal
|
||||
|
||||
### Supporting Documents
|
||||
- `VOLUME_SPLIT_ANALYSIS.md` - Deep analysis of pedagogical issues
|
||||
- `VOLUME_SPLIT_EXECUTIVE_SUMMARY.md` - Quick reference summary
|
||||
- `DISTRIBUTED_CONTENT_ADDITIONS.md` - Distributed awareness additions for V1
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Project Vision
|
||||
|
||||
**Volume I: Introduction to ML Systems** (~1,150-1,200 pages)
|
||||
- Complete single-system ML engineering
|
||||
- Includes distributed awareness (not implementation)
|
||||
- Target: Undergrads, bootcamps, ML engineers entering the field
|
||||
|
||||
**Volume II: Advanced ML Systems** (~1,100-1,150 pages)
|
||||
- Distributed systems, production scale, responsibility
|
||||
- Built on timeless principles
|
||||
- Target: Graduate students, ML infrastructure engineers, senior practitioners
|
||||
|
||||
**The One-Liner**:
|
||||
> "Volume I teaches you to build ML systems that work; Volume II teaches you to build ML systems that scale."
|
||||
|
||||
---
|
||||
|
||||
## 📊 Current State (December 2024)
|
||||
|
||||
### Existing Content
|
||||
- **Total pages**: 2,172 pages across 22 chapters
|
||||
- **Status**: Complete draft of comprehensive single-volume book
|
||||
- **Quality**: Refined through extensive review feedback
|
||||
|
||||
### Content Distribution
|
||||
- **Chapters 1-14**: Form basis of Volume 1 (with surgery)
|
||||
- **Chapters 15-21**: Move to Volume 2
|
||||
- **New content needed**: 325-375 pages (8 new chapters for V2)
|
||||
|
||||
---
|
||||
|
||||
## 🗓️ Six-Month Timeline
|
||||
|
||||
### Phase 1: Chapter Surgery (Months 1-2)
|
||||
**Goal**: Extract distributed content from 7 chapters, create clean V1
|
||||
|
||||
#### Month 1: Content Extraction
|
||||
- **Week 1-2**: Extract distributed content from Chapters 6, 7, 8
|
||||
- [ ] Chapter 6 (Data Engineering): Extract 40 pages of distributed content
|
||||
- [ ] Chapter 7 (Frameworks): Extract 50 pages of distributed execution
|
||||
- [ ] Chapter 8 (Training): Extract 60 pages of distributed training
|
||||
|
||||
- **Week 3-4**: Extract from Chapters 10, 11, 12, 13
|
||||
- [ ] Chapter 10 (Optimizations): Extract 80 pages (NAS, AutoML)
|
||||
- [ ] Chapter 11 (Hardware): Extract 50 pages (multi-chip)
|
||||
- [ ] Chapter 12 (Benchmarking): Extract 40 pages (distributed benchmarking)
|
||||
- [ ] Chapter 13 (MLOps): Extract 30 pages (production scale)
|
||||
|
||||
#### Month 2: Bridging and Polish
|
||||
- **Week 5-6**: Create V1 transitions
|
||||
- [ ] Add "See Volume 2" callout boxes in V1
|
||||
- [ ] Write brief distributed awareness sections
|
||||
- [ ] Ensure V1 chapters remain coherent
|
||||
|
||||
- **Week 7-8**: Organize extracted content
|
||||
- [ ] Create V2 chapter structure
|
||||
- [ ] Place extracted content in appropriate V2 chapters
|
||||
- [ ] Identify gaps in extracted content
|
||||
|
||||
### Phase 2: New Content Development (Months 3-5)
|
||||
|
||||
#### Month 3: Priority 1 Chapters (Essential Infrastructure)
|
||||
- **Week 9-10**: Memory & Storage
|
||||
- [ ] V2 Ch1: Memory Hierarchies for ML (45 pages)
|
||||
- [ ] V2 Ch2: Storage Systems for ML (40 pages)
|
||||
|
||||
- **Week 11-12**: Communication & Distributed Training
|
||||
- [ ] V2 Ch3: Communication & Collective Operations (45 pages)
|
||||
- [ ] V2 Ch4: Distributed Training Systems (50 pages) - integrate extracted content
|
||||
|
||||
#### Month 4: Priority 2 Chapters (Production Requirements)
|
||||
- **Week 13-14**: Fault Tolerance & Inference
|
||||
- [ ] V2 Ch5: Fault Tolerance & Resilience (40 pages)
|
||||
- [ ] V2 Ch6: Inference at Scale (45 pages)
|
||||
|
||||
- **Week 15-16**: Integration
|
||||
- [ ] Integrate all extracted content into new chapters
|
||||
- [ ] Write chapter introductions and conclusions
|
||||
- [ ] Create cross-references
|
||||
|
||||
#### Month 5: Priority 3 Chapters (Specialized Topics)
|
||||
- **Week 17-18**: Edge Systems
|
||||
- [ ] V2 Ch8: Edge Intelligence Systems (50 pages)
|
||||
- [ ] Integrate extracted edge content from Ch2
|
||||
|
||||
- **Week 19-20**: Final new chapters
|
||||
- [ ] V2 Ch1: Bridge Chapter (30 pages) - From Single to Distributed
|
||||
- [ ] Update existing V2 chapters (15-20) with introductions
|
||||
|
||||
### Phase 3: Integration and Polish (Month 6)
|
||||
|
||||
#### Month 6: Final Integration
|
||||
- **Week 21-22**: Cross-References and Consistency
|
||||
- [ ] Update all V1→V2 cross-references
|
||||
- [ ] Update all V2→V1 prerequisite references
|
||||
- [ ] Ensure consistent notation across volumes
|
||||
- [ ] Verify all figure references work
|
||||
|
||||
- **Week 23**: Narrative Flow
|
||||
- [ ] Review V1 narrative arc (Foundations → Building → Optimizing → Impact)
|
||||
- [ ] Review V2 narrative arc (Scale → Production → Responsibility)
|
||||
- [ ] Polish chapter transitions
|
||||
- [ ] Write volume prefaces
|
||||
|
||||
- **Week 24**: Final Quality Checks
|
||||
- [ ] Technical accuracy review
|
||||
- [ ] Page count verification
|
||||
- [ ] Exercise and quiz consistency
|
||||
- [ ] Final copyedit pass
|
||||
- [ ] Prepare camera-ready manuscripts
|
||||
|
||||
---
|
||||
|
||||
## 📈 Progress Tracking
|
||||
|
||||
### Volume 1 Progress (Target: 1,150-1,200 pages)
|
||||
|
||||
| Chapter | Current | Target | Surgery Status | Notes |
|
||||
|---------|---------|--------|----------------|-------|
|
||||
| 1. Introduction | 90 | 60 | ⬜ Not Started | Compress history section |
|
||||
| 2. ML Systems | 70 | 70 | ⬜ Not Started | Extract hybrid architectures |
|
||||
| 3. DL Primer | 110 | 100 | ⬜ Not Started | No surgery (keep as-is) |
|
||||
| 4. DNN Architectures | 82 | 100 | ⬜ Not Started | No surgery (keep as-is) |
|
||||
| 5. Workflow | 51 | 40 | ⬜ Not Started | Minor compression |
|
||||
| 6. Data Engineering | 138 | 80 | ⬜ Not Started | Extract distributed storage |
|
||||
| 7. Frameworks | 121 | 100 | ⬜ Not Started | Extract distributed execution |
|
||||
| 8. Training | 157 | 100 | ⬜ Not Started | Extract distributed training |
|
||||
| 9. Efficient AI | 52 | 60 | ⬜ Not Started | No surgery (keep as-is) |
|
||||
| 10. Optimizations | 160 | 120 | ⬜ Not Started | Extract NAS, AutoML |
|
||||
| 11. Hardware | 181 | 90 | ⬜ Not Started | Extract multi-chip |
|
||||
| 12. Benchmarking | 124 | 80 | ⬜ Not Started | Extract distributed benchmarking |
|
||||
| 13. MLOps | 126 | 50 | ⬜ Not Started | Extract production scale |
|
||||
| 14. AI for Good | 84 | 50 | ⬜ Not Started | Minor compression |
|
||||
| **TOTAL** | **1,546** | **1,100** | **0%** | **V1 baseline complete** |
|
||||
|
||||
Progress Key: ⬜ Not Started | 🟨 In Progress | ✅ Complete
|
||||
|
||||
### Volume 2 Progress (Target: 1,100-1,150 pages)
|
||||
|
||||
| Chapter | Source | Target | Status | Notes |
|
||||
|---------|--------|--------|--------|-------|
|
||||
| 1. Bridge: Single to Distributed | NEW | 30 | ⬜ Not Started | Write from scratch |
|
||||
| 2. Memory Hierarchies | NEW + Ch11 | 45 | ⬜ Not Started | New content + extracts |
|
||||
| 3. Storage Systems | NEW + Ch6 | 40 | ⬜ Not Started | New content + extracts |
|
||||
| 4. Communication & Collectives | NEW + Ch8 | 45 | ⬜ Not Started | New content + extracts |
|
||||
| 5. Distributed Training | Ch8 + Ch10 | 50 | ⬜ Not Started | Consolidate extracts |
|
||||
| 6. Fault Tolerance | NEW + Ch13 | 40 | ⬜ Not Started | New content + extracts |
|
||||
| 7. Inference at Scale | NEW + Ch2 + Ch13 | 45 | ⬜ Not Started | New content + extracts |
|
||||
| 8. Edge Intelligence | NEW + Ch2 | 50 | ⬜ Not Started | New content + extracts |
|
||||
| 9. On-Device Learning | Ch14 (existing) | 127 | ⬜ Not Started | Move from V1 |
|
||||
| 10. Privacy Systems | Ch15 (split) | 65 | ⬜ Not Started | Split Privacy/Security |
|
||||
| 11. Security Systems | Ch15 (split) | 68 | ⬜ Not Started | Split Privacy/Security |
|
||||
| 12. Robust AI | Ch16 (existing) | 137 | ⬜ Not Started | Move from V1 |
|
||||
| 13. Responsible AI | Ch17 (existing) | 135 | ⬜ Not Started | Move from V1 |
|
||||
| 14. Sustainable AI | Ch18 (existing) | 46 | ⬜ Not Started | Move from V1 |
|
||||
| 15. Frontiers & AGI | Ch19+20 (merge) | 78 | ⬜ Not Started | Merge two chapters |
|
||||
| **TOTAL** | **Mixed** | **1,001** | **0%** | **Need ~100 more pages** |
|
||||
|
||||
### New Content Writing Progress (325-375 pages needed)
|
||||
|
||||
| Chapter | Pages | Draft | Review | Final | Notes |
|
||||
|---------|-------|-------|--------|-------|-------|
|
||||
| Bridge Chapter | 30 | ⬜ | ⬜ | ⬜ | Priority 1 |
|
||||
| Memory Hierarchies | 45 | ⬜ | ⬜ | ⬜ | Priority 1 |
|
||||
| Storage Systems | 40 | ⬜ | ⬜ | ⬜ | Priority 1 |
|
||||
| Communication | 45 | ⬜ | ⬜ | ⬜ | Priority 2 |
|
||||
| Distributed Training | 50 | ⬜ | ⬜ | ⬜ | Priority 1 (+ extracts) |
|
||||
| Fault Tolerance | 40 | ⬜ | ⬜ | ⬜ | Priority 2 |
|
||||
| Inference at Scale | 45 | ⬜ | ⬜ | ⬜ | Priority 1 |
|
||||
| Edge Intelligence | 50 | ⬜ | ⬜ | ⬜ | Priority 3 |
|
||||
| **TOTAL NEW** | **345** | **0%** | **0%** | **0%** | 8 new chapters |
|
||||
|
||||
---
|
||||
|
||||
## 📝 Weekly Progress Log
|
||||
|
||||
### Week 1 (Dec 2-8, 2024)
|
||||
- [ ] Created comprehensive surgical plan
|
||||
- [ ] Created MIT Press proposal
|
||||
- [ ] Created master roadmap
|
||||
- [ ] **Next**: Begin Chapter 6 surgery
|
||||
|
||||
### Week 2 (Dec 9-15, 2024)
|
||||
- [ ] Progress notes...
|
||||
|
||||
### Week 3 (Dec 16-22, 2024)
|
||||
- [ ] Progress notes...
|
||||
|
||||
*(Continue weekly log throughout 6-month period)*
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Key Milestones
|
||||
|
||||
- [ ] **End Month 1**: All distributed content extracted from V1 chapters
|
||||
- [ ] **End Month 2**: V1 chapters coherent and complete
|
||||
- [ ] **End Month 3**: Priority 1 new chapters drafted (160 pages)
|
||||
- [ ] **End Month 4**: Priority 2 new chapters drafted (85 pages)
|
||||
- [ ] **End Month 5**: All new content drafted (345 pages)
|
||||
- [ ] **End Month 6**: Camera-ready manuscripts for both volumes
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Risks and Mitigation
|
||||
|
||||
### Risk 1: Page Count Imbalance
|
||||
**Risk**: V1 or V2 ends up significantly larger/smaller than target
|
||||
**Mitigation**:
|
||||
- Monitor page counts weekly
|
||||
- Adjust compression/expansion as needed
|
||||
- Have flexibility targets (±100 pages)
|
||||
|
||||
### Risk 2: Missing Dependencies
|
||||
**Risk**: V2 assumes V1 knowledge not actually covered
|
||||
**Mitigation**:
|
||||
- Create prerequisite matrix
|
||||
- Add recap sections to V2 chapters
|
||||
- Review cross-references monthly
|
||||
|
||||
### Risk 3: Timeline Slippage
|
||||
**Risk**: New chapter writing takes longer than estimated
|
||||
**Mitigation**:
|
||||
- Prioritize essential chapters first
|
||||
- Have backup plan to defer Priority 3 chapters
|
||||
- Build 2-week buffer into timeline
|
||||
|
||||
### Risk 4: Content Duplication
|
||||
**Risk**: Same concept explained in both volumes
|
||||
**Mitigation**:
|
||||
- Clear "basic vs. advanced" delineation
|
||||
- V2 references V1 explicitly
|
||||
- Review for overlap in Month 6
|
||||
|
||||
---
|
||||
|
||||
## 📚 Reference Materials
|
||||
|
||||
### Pedagogical Framework
|
||||
- **V1 Narrative**: Foundations → Building → Optimizing → Impact
|
||||
- **V2 Narrative**: Scale → Production → Responsibility
|
||||
- **Connection**: V1 ends with inspiration, V2 begins with bridge
|
||||
|
||||
### Chapter Surgery Guidelines
|
||||
- **Single-machine boundary**: Keep in V1
|
||||
- **Distributed systems**: Move to V2
|
||||
- **Production scale**: Move to V2
|
||||
- **Advanced optimization**: Move to V2
|
||||
|
||||
### Writing Standards
|
||||
- Timeless principles over current tech
|
||||
- Every chapter has: Purpose, Learning Outcomes, Summary
|
||||
- Concrete examples throughout
|
||||
- "Fallacies and Pitfalls" section
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Tools and Workflow
|
||||
|
||||
### Version Control
|
||||
- [ ] Create `volume-split` branch in Git
|
||||
- [ ] Track all changes in branch
|
||||
- [ ] Regular commits with clear messages
|
||||
|
||||
### Organization
|
||||
- `book/volume1/` - Volume 1 chapters
|
||||
- `book/volume2/` - Volume 2 chapters
|
||||
- `book/docs/` - All planning documents
|
||||
- `book/extracted/` - Content extracted from V1
|
||||
|
||||
### Quality Checks
|
||||
- [ ] Weekly page count tracking
|
||||
- [ ] Monthly cross-reference review
|
||||
- [ ] Technical accuracy spot checks
|
||||
- [ ] Pedagogical flow reviews
|
||||
|
||||
---
|
||||
|
||||
## 📞 Stakeholder Communication
|
||||
|
||||
### MIT Press Updates
|
||||
- **Monthly**: Progress report with page counts
|
||||
- **Major milestones**: Notify when phases complete
|
||||
- **Issues**: Immediate communication of risks
|
||||
|
||||
### Community/Reviewers
|
||||
- **End Month 2**: Share V1 draft for review
|
||||
- **End Month 4**: Share V2 draft chapters for review
|
||||
- **End Month 5**: Full review cycle
|
||||
|
||||
---
|
||||
|
||||
## ✅ Final Checklist (Month 6)
|
||||
|
||||
### Volume 1 Completion
|
||||
- [ ] All 14 chapters present and coherent
|
||||
- [ ] Page count: 1,150-1,250 pages
|
||||
- [ ] All cross-references to V2 marked clearly
|
||||
- [ ] Exercises and quizzes updated
|
||||
- [ ] Figures and tables numbered correctly
|
||||
- [ ] Bibliography complete
|
||||
- [ ] Index prepared
|
||||
|
||||
### Volume 2 Completion
|
||||
- [ ] All 15 chapters present and coherent
|
||||
- [ ] Page count: 1,100-1,200 pages
|
||||
- [ ] Bridge chapter effective
|
||||
- [ ] New chapters integrate extracted content
|
||||
- [ ] Exercises and quizzes complete
|
||||
- [ ] Figures and tables numbered correctly
|
||||
- [ ] Bibliography complete
|
||||
- [ ] Index prepared
|
||||
|
||||
### Both Volumes
|
||||
- [ ] Consistent notation across volumes
|
||||
- [ ] No content duplication
|
||||
- [ ] Clear prerequisite chain
|
||||
- [ ] Professional copyedit complete
|
||||
- [ ] Ready for MIT Press submission
|
||||
|
||||
---
|
||||
|
||||
## 📊 Success Metrics
|
||||
|
||||
### Quantitative
|
||||
- Volume 1: 1,150-1,250 pages ✓/✗
|
||||
- Volume 2: 1,100-1,200 pages ✓/✗
|
||||
- New content: 325-375 pages ✓/✗
|
||||
- Timeline: 6 months ✓/✗
|
||||
|
||||
### Qualitative
|
||||
- Each volume independently valuable ✓/✗
|
||||
- Clear pedagogical progression ✓/✗
|
||||
- MIT Press approval ✓/✗
|
||||
- Reviewer feedback positive ✓/✗
|
||||
|
||||
---
|
||||
|
||||
## 🎓 Post-Completion
|
||||
|
||||
### Publication Process
|
||||
- [ ] Submit to MIT Press
|
||||
- [ ] Incorporate editorial feedback
|
||||
- [ ] Final production review
|
||||
- [ ] Marketing materials
|
||||
- [ ] Course adoption outreach
|
||||
|
||||
### Maintenance
|
||||
- [ ] Errata tracking system
|
||||
- [ ] Annual review cycle
|
||||
- [ ] Community feedback integration
|
||||
- [ ] Future edition planning
|
||||
|
||||
---
|
||||
|
||||
## 📝 Notes and Decisions
|
||||
|
||||
### December 2024 - Project Launch
|
||||
- Decision: Committed to 6-month timeline
|
||||
- Decision: Will do full surgery, not quick split
|
||||
- Decision: Flagship quality is priority over speed
|
||||
- Next decision needed: [track decisions here]
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: December 7, 2024
|
||||
**Status**: Planning Complete - Ready to Begin Execution
|
||||
**Next Action**: Begin Chapter 6 (Data Engineering) surgery - Week 1
|
||||
|
||||
---
|
||||
|
||||
*This roadmap is the master coordination document for the two-volume split project. Update weekly with progress, decisions, and course corrections.*
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,216 +0,0 @@
|
||||
# Machine Learning Systems: Two-Volume Structure
|
||||
|
||||
**Proposal for MIT Press**
|
||||
*Draft: December 2024*
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The *Machine Learning Systems* textbook will be published as two complementary volumes of 14 chapters each:
|
||||
|
||||
| Volume | Title | Focus | Chapters |
|
||||
|--------|-------|-------|----------|
|
||||
| **Volume 1** | Introduction to Machine Learning Systems | Complete ML lifecycle, single-system focus | 14 (all existing) |
|
||||
| **Volume 2** | Advanced Machine Learning Systems | Principles of scale, distribution, and production | 14 (6 existing, 8 new) |
|
||||
|
||||
**Guiding Philosophy:**
|
||||
- **Volume 1**: Everything you need to build ML systems on a single machine, ending on a positive note with societal impact
|
||||
- **Volume 2**: Timeless principles for operating ML systems at scale, grounded in physics and mathematics rather than current technologies
|
||||
|
||||
---
|
||||
|
||||
## Volume 1: Introduction to Machine Learning Systems
|
||||
|
||||
*The complete ML lifecycle: understand it, build it, optimize it, deploy it, use it for good.*
|
||||
|
||||
| Part | Chapter | Description |
|
||||
|------|---------|-------------|
|
||||
| **Part I: Systems Foundations** | | *What are ML systems?* |
|
||||
| | 1. Introduction | Motivation and scope |
|
||||
| | 2. ML Systems | System-level view of machine learning |
|
||||
| | 3. Deep Learning Primer | Neural network fundamentals |
|
||||
| | 4. DNN Architectures | Modern architecture patterns |
|
||||
| **Part II: Design Principles** | | *How do you build ML systems?* |
|
||||
| | 5. Workflow | End-to-end ML pipeline design |
|
||||
| | 6. Data Engineering | Data collection, processing, validation |
|
||||
| | 7. Frameworks | PyTorch, TensorFlow, JAX ecosystem |
|
||||
| | 8. Training | Training loops, hyperparameters, convergence |
|
||||
| **Part III: Performance Engineering** | | *How do you make ML systems fast?* |
|
||||
| | 9. Efficient AI | Efficiency principles and metrics |
|
||||
| | 10. Optimizations | Quantization, pruning, distillation |
|
||||
| | 11. Hardware Acceleration | GPUs, TPUs, custom accelerators |
|
||||
| | 12. Benchmarking | Measurement, MLPerf, evaluation methodology |
|
||||
| **Part IV: Practice & Impact** | | *How do you deploy and use ML systems responsibly?* |
|
||||
| | 13. ML Operations | Deployment, monitoring, CI/CD for ML |
|
||||
| | 14. AI for Good | Positive societal applications |
|
||||
|
||||
**Total: 14 chapters across 4 parts (all existing content)**
|
||||
|
||||
*Early awareness:* include a short Sustainable AI note in Benchmarking or ML Operations to flag energy and carbon impacts without adding another chapter.
|
||||
|
||||
### Volume 1 Narrative Arc
|
||||
|
||||
The book progresses from understanding → building → optimizing → deploying → impact:
|
||||
|
||||
1. **Foundations** establish what ML systems are and why they matter
|
||||
2. **Design** teaches how to construct complete pipelines
|
||||
3. **Performance** shows how to make systems efficient
|
||||
4. **Practice & Impact** completes the lifecycle and ends on an inspirational note
|
||||
|
||||
Ending on "AI for Good" leaves students with a positive vision of what they can build.
|
||||
|
||||
---
|
||||
|
||||
## Volume 2: Advanced Machine Learning Systems
|
||||
|
||||
*Timeless principles for building and operating ML systems at scale.*
|
||||
|
||||
| Part | Chapter | Status | Description |
|
||||
|------|---------|--------|-------------|
|
||||
| **Part I: Data Movement & Memory** | | | *Moving data is the bottleneck* |
|
||||
| | 1. Memory Hierarchies for ML | 🆕 NEW | GPU memory, HBM, activation checkpointing |
|
||||
| | 2. Storage Systems for ML | 🆕 NEW | Distributed storage, checkpointing, feature stores |
|
||||
| | 3. Communication & Collective Operations | 🆕 NEW | AllReduce, gradient compression, network topology |
|
||||
| **Part II: Parallelism & Coordination** | | | *Decomposing computation across machines* |
|
||||
| | 4. Distributed Training | 🆕 NEW | Data/model/pipeline/tensor parallelism |
|
||||
| | 5. Fault Tolerance & Recovery | 🆕 NEW | Checkpointing, elastic training, failure handling |
|
||||
| | 6. Inference Systems | 🆕 NEW | Batching, serving architectures, autoscaling |
|
||||
| **Part III: Constrained Environments** | | | *Doing more with less* |
|
||||
| | 7. On-device Learning | Existing | Training and adaptation on edge devices |
|
||||
| | 8. Edge Deployment | 🆕 NEW | Compilation, runtime optimization, real-time |
|
||||
| **Part IV: Adversarial Environments** | | | *Systems under attack and uncertainty* |
|
||||
| | 9. Privacy in ML Systems | Existing | Differential privacy, federated learning, secure aggregation |
|
||||
| | 10. Security in ML Systems | 🆕 NEW | Supply chain, API security, multi-tenant isolation |
|
||||
| | 11. Robust AI | Existing | Adversarial robustness, distribution shift, monitoring |
|
||||
| **Part V: Stewardship** | | | *Building systems that serve humanity* |
|
||||
| | 12. Responsible AI | Existing | Fairness, accountability, transparency at scale |
|
||||
| | 13. Sustainable AI | Existing | Energy efficiency, carbon footprint, environmental impact |
|
||||
| | 14. Frontiers & Future Directions | Existing | Emerging paradigms, open problems, conclusion |
|
||||
|
||||
**Total: 14 chapters across 5 parts (6 existing, 8 new)**
|
||||
|
||||
---
|
||||
|
||||
## New Content for Volume 2
|
||||
|
||||
### Part I: Data Movement & Memory
|
||||
*The physics of data movement is the fundamental constraint in modern ML.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Memory Hierarchies for ML** | GPU memory management, HBM architecture, caching strategies, activation checkpointing, memory-efficient attention | Memory bandwidth limits compute utilization |
|
||||
| **Storage Systems for ML** | Distributed file systems, checkpoint I/O, feature stores, data lakes, prefetching, I/O scheduling | Storage throughput gates training speed |
|
||||
| **Communication & Collective Operations** | AllReduce algorithms, ring/tree topologies, gradient compression, RDMA fundamentals, network topology design | Communication overhead limits scaling |
|
||||
|
||||
### Part II: Parallelism & Coordination
|
||||
*The mathematics of decomposing work across machines.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Distributed Training** | Data parallelism, model parallelism (tensor, pipeline, expert), hybrid strategies, synchronization, load balancing | Parallelism has fundamental trade-offs |
|
||||
| **Fault Tolerance & Recovery** | Checkpoint strategies, async checkpointing, elastic training, failure detection, graceful degradation | Large systems fail; recovery must be designed in |
|
||||
| **Inference Systems** | Batching strategies, continuous batching, KV cache management, model serving patterns, autoscaling, SLO management | Serving has different constraints than training |
|
||||
|
||||
### Part III: Constrained Environments
|
||||
*Operating under resource limitations.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Edge Deployment** | Model compilation, runtime optimization, heterogeneous hardware, real-time constraints, power management | Constraints force creativity |
|
||||
|
||||
### Part IV: Adversarial Environments
|
||||
*Systems facing attacks, privacy requirements, and uncertainty.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Security in ML Systems** | Model provenance, supply chain security, API protection, multi-tenant isolation, access control | Production systems face adversaries |
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
### Why This Structure Works
|
||||
|
||||
**Volume 1 (Single System)**
|
||||
- Teaches the complete lifecycle
|
||||
- Everything can be learned and practiced on one machine
|
||||
- Ends positively with societal impact
|
||||
|
||||
**Volume 2 (Distributed Systems)**
|
||||
- Builds on Volume 1 foundations
|
||||
- Addresses what changes at scale
|
||||
- Organized around timeless constraints, not current technologies
|
||||
|
||||
### What Makes Volume 2 Timeless
|
||||
|
||||
Each part addresses constraints rooted in physics, mathematics, or human nature:
|
||||
|
||||
| Part | Eternal Constraint | Foundation |
|
||||
|------|-------------------|------------|
|
||||
| Data Movement & Memory | Moving data costs more than compute | Physics: speed of light, memory bandwidth |
|
||||
| Parallelism & Coordination | Work must be decomposed and synchronized | Mathematics of parallel computation |
|
||||
| Constrained Environments | Resources are always finite | Economics and physics |
|
||||
| Adversarial Environments | Attackers and uncertainty exist | Human nature, statistics |
|
||||
| Stewardship | Technology must serve humanity | Ethics, sustainability |
|
||||
|
||||
Chapters use current examples (LLMs, transformers, specific hardware) but frame them as instances of these enduring principles.
|
||||
|
||||
---
|
||||
|
||||
## Content Migration Summary
|
||||
|
||||
| Chapter | Volume 1 | Volume 2 | Rationale |
|
||||
|---------|----------|----------|-----------|
|
||||
| Introduction through Benchmarking | ✓ | | Core technical content |
|
||||
| ML Operations | ✓ | | Completes the lifecycle |
|
||||
| AI for Good | ✓ | | Positive conclusion |
|
||||
| On-device Learning | | ✓ | Edge/constrained is advanced |
|
||||
| Privacy & Security | | ✓ | Production security is advanced |
|
||||
| Robust AI | | ✓ | Production robustness is advanced |
|
||||
| Responsible AI | | ✓ | Scale changes the challenges |
|
||||
| Sustainable AI | | ✓ | Datacenter scale is advanced |
|
||||
| Frontiers | | ✓ | Conclusion for advanced volume |
|
||||
|
||||
---
|
||||
|
||||
## Audience
|
||||
|
||||
| Volume | Primary Audience | Use Cases |
|
||||
|--------|-----------------|-----------|
|
||||
| Volume 1 | All ML practitioners, undergraduates, bootcamp students | First course in ML systems, self-study |
|
||||
| Volume 2 | Infrastructure engineers, graduate students, researchers | Advanced course, reference for practitioners at scale |
|
||||
|
||||
---
|
||||
|
||||
## Collaboration Model
|
||||
|
||||
Volume 2's new chapters are candidates for collaborative authorship:
|
||||
|
||||
| Topic Area | Ideal Collaborator Profile |
|
||||
|------------|---------------------------|
|
||||
| Memory & Storage | Datacenter architects, MLPerf Storage contributors |
|
||||
| Networking & Communication | Distributed systems researchers, framework developers |
|
||||
| Distributed Training | PyTorch/JAX distributed teams, hyperscaler engineers |
|
||||
| Fault Tolerance | Site reliability engineers, systems researchers |
|
||||
| Inference Systems | ML serving infrastructure engineers |
|
||||
| Edge Deployment | Embedded ML practitioners, compiler engineers |
|
||||
| Security | ML security researchers, production security engineers |
|
||||
|
||||
---
|
||||
|
||||
## Summary Statistics
|
||||
|
||||
| Metric | Volume 1 | Volume 2 |
|
||||
|--------|----------|----------|
|
||||
| Chapters | 14 | 14 |
|
||||
| Parts | 4 | 5 |
|
||||
| Existing content | 14 | 6 |
|
||||
| New content | 0 | 8 |
|
||||
| Focus | Single system | Distributed systems |
|
||||
| Prerequisite | None | Volume 1 |
|
||||
|
||||
---
|
||||
|
||||
*Document Version: December 2024*
|
||||
*For discussion with MIT Press and potential collaborators*
|
||||
Reference in New Issue
Block a user