mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-12 02:06:14 -05:00
docs: remove stale volume planning documents
Remove outdated planning documents that have been superseded by VOLUME_STRUCTURE.md which now serves as the authoritative reference for the two-volume textbook organization. Removed files: - VOLUME_SPLIT_ROADMAP.md - VOLUME_SPLIT_SURGICAL_PLAN.md - VOLUME_STRUCTURE_PROPOSAL.md - reviewer-feedback-synthesis-r1.md - volume-outline-draft.md
This commit is contained in:
@@ -1,394 +0,0 @@
|
||||
# Machine Learning Systems: Two-Volume Split - Master Roadmap
|
||||
|
||||
**Project Start Date**: December 2024
|
||||
**Target Completion**: June 2025 (6 months)
|
||||
**Goal**: Create flagship two-volume ML Systems textbook series for MIT Press
|
||||
|
||||
---
|
||||
|
||||
## 📋 Project Documents
|
||||
|
||||
### Core Planning Documents
|
||||
1. **`MIT_PRESS_PROPOSAL.md`** - Official proposal for MIT Press (ready to submit)
|
||||
2. **`VOLUME_SPLIT_SURGICAL_PLAN.md`** - Section-by-section surgery instructions (50+ pages)
|
||||
3. **`VOLUME_SPLIT_ROADMAP.md`** - This document: master project plan and progress tracker
|
||||
4. **`VOLUME_STRUCTURE_PROPOSAL.md`** - Original analysis and proposal
|
||||
|
||||
### Supporting Documents
|
||||
- `VOLUME_SPLIT_ANALYSIS.md` - Deep analysis of pedagogical issues
|
||||
- `VOLUME_SPLIT_EXECUTIVE_SUMMARY.md` - Quick reference summary
|
||||
- `DISTRIBUTED_CONTENT_ADDITIONS.md` - Distributed awareness additions for V1
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Project Vision
|
||||
|
||||
**Volume I: Introduction to ML Systems** (~1,150-1,200 pages)
|
||||
- Complete single-system ML engineering
|
||||
- Includes distributed awareness (not implementation)
|
||||
- Target: Undergrads, bootcamps, ML engineers entering the field
|
||||
|
||||
**Volume II: Advanced ML Systems** (~1,100-1,150 pages)
|
||||
- Distributed systems, production scale, responsibility
|
||||
- Built on timeless principles
|
||||
- Target: Graduate students, ML infrastructure engineers, senior practitioners
|
||||
|
||||
**The One-Liner**:
|
||||
> "Volume I teaches you to build ML systems that work; Volume II teaches you to build ML systems that scale."
|
||||
|
||||
---
|
||||
|
||||
## 📊 Current State (December 2024)
|
||||
|
||||
### Existing Content
|
||||
- **Total pages**: 2,172 pages across 22 chapters
|
||||
- **Status**: Complete draft of comprehensive single-volume book
|
||||
- **Quality**: Refined through extensive review feedback
|
||||
|
||||
### Content Distribution
|
||||
- **Chapters 1-14**: Form basis of Volume 1 (with surgery)
|
||||
- **Chapters 15-21**: Move to Volume 2
|
||||
- **New content needed**: 325-375 pages (8 new chapters for V2)
|
||||
|
||||
---
|
||||
|
||||
## 🗓️ Six-Month Timeline
|
||||
|
||||
### Phase 1: Chapter Surgery (Months 1-2)
|
||||
**Goal**: Extract distributed content from 7 chapters, create clean V1
|
||||
|
||||
#### Month 1: Content Extraction
|
||||
- **Week 1-2**: Extract distributed content from Chapters 6, 7, 8
|
||||
- [ ] Chapter 6 (Data Engineering): Extract 40 pages of distributed content
|
||||
- [ ] Chapter 7 (Frameworks): Extract 50 pages of distributed execution
|
||||
- [ ] Chapter 8 (Training): Extract 60 pages of distributed training
|
||||
|
||||
- **Week 3-4**: Extract from Chapters 10, 11, 12, 13
|
||||
- [ ] Chapter 10 (Optimizations): Extract 80 pages (NAS, AutoML)
|
||||
- [ ] Chapter 11 (Hardware): Extract 50 pages (multi-chip)
|
||||
- [ ] Chapter 12 (Benchmarking): Extract 40 pages (distributed benchmarking)
|
||||
- [ ] Chapter 13 (MLOps): Extract 30 pages (production scale)
|
||||
|
||||
#### Month 2: Bridging and Polish
|
||||
- **Week 5-6**: Create V1 transitions
|
||||
- [ ] Add "See Volume 2" callout boxes in V1
|
||||
- [ ] Write brief distributed awareness sections
|
||||
- [ ] Ensure V1 chapters remain coherent
|
||||
|
||||
- **Week 7-8**: Organize extracted content
|
||||
- [ ] Create V2 chapter structure
|
||||
- [ ] Place extracted content in appropriate V2 chapters
|
||||
- [ ] Identify gaps in extracted content
|
||||
|
||||
### Phase 2: New Content Development (Months 3-5)
|
||||
|
||||
#### Month 3: Priority 1 Chapters (Essential Infrastructure)
|
||||
- **Week 9-10**: Memory & Storage
|
||||
- [ ] V2 Ch1: Memory Hierarchies for ML (45 pages)
|
||||
- [ ] V2 Ch2: Storage Systems for ML (40 pages)
|
||||
|
||||
- **Week 11-12**: Communication & Distributed Training
|
||||
- [ ] V2 Ch3: Communication & Collective Operations (45 pages)
|
||||
- [ ] V2 Ch4: Distributed Training Systems (50 pages) - integrate extracted content
|
||||
|
||||
#### Month 4: Priority 2 Chapters (Production Requirements)
|
||||
- **Week 13-14**: Fault Tolerance & Inference
|
||||
- [ ] V2 Ch5: Fault Tolerance & Resilience (40 pages)
|
||||
- [ ] V2 Ch6: Inference at Scale (45 pages)
|
||||
|
||||
- **Week 15-16**: Integration
|
||||
- [ ] Integrate all extracted content into new chapters
|
||||
- [ ] Write chapter introductions and conclusions
|
||||
- [ ] Create cross-references
|
||||
|
||||
#### Month 5: Priority 3 Chapters (Specialized Topics)
|
||||
- **Week 17-18**: Edge Systems
|
||||
- [ ] V2 Ch8: Edge Intelligence Systems (50 pages)
|
||||
- [ ] Integrate extracted edge content from Ch2
|
||||
|
||||
- **Week 19-20**: Final new chapters
|
||||
- [ ] V2 Ch1: Bridge Chapter (30 pages) - From Single to Distributed
|
||||
- [ ] Update existing V2 chapters (15-20) with introductions
|
||||
|
||||
### Phase 3: Integration and Polish (Month 6)
|
||||
|
||||
#### Month 6: Final Integration
|
||||
- **Week 21-22**: Cross-References and Consistency
|
||||
- [ ] Update all V1→V2 cross-references
|
||||
- [ ] Update all V2→V1 prerequisite references
|
||||
- [ ] Ensure consistent notation across volumes
|
||||
- [ ] Verify all figure references work
|
||||
|
||||
- **Week 23**: Narrative Flow
|
||||
- [ ] Review V1 narrative arc (Foundations → Building → Optimizing → Impact)
|
||||
- [ ] Review V2 narrative arc (Scale → Production → Responsibility)
|
||||
- [ ] Polish chapter transitions
|
||||
- [ ] Write volume prefaces
|
||||
|
||||
- **Week 24**: Final Quality Checks
|
||||
- [ ] Technical accuracy review
|
||||
- [ ] Page count verification
|
||||
- [ ] Exercise and quiz consistency
|
||||
- [ ] Final copyedit pass
|
||||
- [ ] Prepare camera-ready manuscripts
|
||||
|
||||
---
|
||||
|
||||
## 📈 Progress Tracking
|
||||
|
||||
### Volume 1 Progress (Target: 1,150-1,200 pages)
|
||||
|
||||
| Chapter | Current | Target | Surgery Status | Notes |
|
||||
|---------|---------|--------|----------------|-------|
|
||||
| 1. Introduction | 90 | 60 | ⬜ Not Started | Compress history section |
|
||||
| 2. ML Systems | 70 | 70 | ⬜ Not Started | Extract hybrid architectures |
|
||||
| 3. DL Primer | 110 | 100 | ⬜ Not Started | No surgery (keep as-is) |
|
||||
| 4. DNN Architectures | 82 | 100 | ⬜ Not Started | No surgery (keep as-is) |
|
||||
| 5. Workflow | 51 | 40 | ⬜ Not Started | Minor compression |
|
||||
| 6. Data Engineering | 138 | 80 | ⬜ Not Started | Extract distributed storage |
|
||||
| 7. Frameworks | 121 | 100 | ⬜ Not Started | Extract distributed execution |
|
||||
| 8. Training | 157 | 100 | ⬜ Not Started | Extract distributed training |
|
||||
| 9. Efficient AI | 52 | 60 | ⬜ Not Started | No surgery (keep as-is) |
|
||||
| 10. Optimizations | 160 | 120 | ⬜ Not Started | Extract NAS, AutoML |
|
||||
| 11. Hardware | 181 | 90 | ⬜ Not Started | Extract multi-chip |
|
||||
| 12. Benchmarking | 124 | 80 | ⬜ Not Started | Extract distributed benchmarking |
|
||||
| 13. MLOps | 126 | 50 | ⬜ Not Started | Extract production scale |
|
||||
| 14. AI for Good | 84 | 50 | ⬜ Not Started | Minor compression |
|
||||
| **TOTAL** | **1,546** | **1,100** | **0%** | **V1 baseline complete** |
|
||||
|
||||
Progress Key: ⬜ Not Started | 🟨 In Progress | ✅ Complete
|
||||
|
||||
### Volume 2 Progress (Target: 1,100-1,150 pages)
|
||||
|
||||
| Chapter | Source | Target | Status | Notes |
|
||||
|---------|--------|--------|--------|-------|
|
||||
| 1. Bridge: Single to Distributed | NEW | 30 | ⬜ Not Started | Write from scratch |
|
||||
| 2. Memory Hierarchies | NEW + Ch11 | 45 | ⬜ Not Started | New content + extracts |
|
||||
| 3. Storage Systems | NEW + Ch6 | 40 | ⬜ Not Started | New content + extracts |
|
||||
| 4. Communication & Collectives | NEW + Ch8 | 45 | ⬜ Not Started | New content + extracts |
|
||||
| 5. Distributed Training | Ch8 + Ch10 | 50 | ⬜ Not Started | Consolidate extracts |
|
||||
| 6. Fault Tolerance | NEW + Ch13 | 40 | ⬜ Not Started | New content + extracts |
|
||||
| 7. Inference at Scale | NEW + Ch2 + Ch13 | 45 | ⬜ Not Started | New content + extracts |
|
||||
| 8. Edge Intelligence | NEW + Ch2 | 50 | ⬜ Not Started | New content + extracts |
|
||||
| 9. On-Device Learning | Ch14 (existing) | 127 | ⬜ Not Started | Move from V1 |
|
||||
| 10. Privacy Systems | Ch15 (split) | 65 | ⬜ Not Started | Split Privacy/Security |
|
||||
| 11. Security Systems | Ch15 (split) | 68 | ⬜ Not Started | Split Privacy/Security |
|
||||
| 12. Robust AI | Ch16 (existing) | 137 | ⬜ Not Started | Move from V1 |
|
||||
| 13. Responsible AI | Ch17 (existing) | 135 | ⬜ Not Started | Move from V1 |
|
||||
| 14. Sustainable AI | Ch18 (existing) | 46 | ⬜ Not Started | Move from V1 |
|
||||
| 15. Frontiers & AGI | Ch19+20 (merge) | 78 | ⬜ Not Started | Merge two chapters |
|
||||
| **TOTAL** | **Mixed** | **1,001** | **0%** | **Need ~100 more pages** |
|
||||
|
||||
### New Content Writing Progress (325-375 pages needed)
|
||||
|
||||
| Chapter | Pages | Draft | Review | Final | Notes |
|
||||
|---------|-------|-------|--------|-------|-------|
|
||||
| Bridge Chapter | 30 | ⬜ | ⬜ | ⬜ | Priority 1 |
|
||||
| Memory Hierarchies | 45 | ⬜ | ⬜ | ⬜ | Priority 1 |
|
||||
| Storage Systems | 40 | ⬜ | ⬜ | ⬜ | Priority 1 |
|
||||
| Communication | 45 | ⬜ | ⬜ | ⬜ | Priority 2 |
|
||||
| Distributed Training | 50 | ⬜ | ⬜ | ⬜ | Priority 1 (+ extracts) |
|
||||
| Fault Tolerance | 40 | ⬜ | ⬜ | ⬜ | Priority 2 |
|
||||
| Inference at Scale | 45 | ⬜ | ⬜ | ⬜ | Priority 1 |
|
||||
| Edge Intelligence | 50 | ⬜ | ⬜ | ⬜ | Priority 3 |
|
||||
| **TOTAL NEW** | **345** | **0%** | **0%** | **0%** | 8 new chapters |
|
||||
|
||||
---
|
||||
|
||||
## 📝 Weekly Progress Log
|
||||
|
||||
### Week 1 (Dec 2-8, 2024)
|
||||
- [ ] Created comprehensive surgical plan
|
||||
- [ ] Created MIT Press proposal
|
||||
- [ ] Created master roadmap
|
||||
- [ ] **Next**: Begin Chapter 6 surgery
|
||||
|
||||
### Week 2 (Dec 9-15, 2024)
|
||||
- [ ] Progress notes...
|
||||
|
||||
### Week 3 (Dec 16-22, 2024)
|
||||
- [ ] Progress notes...
|
||||
|
||||
*(Continue weekly log throughout 6-month period)*
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Key Milestones
|
||||
|
||||
- [ ] **End Month 1**: All distributed content extracted from V1 chapters
|
||||
- [ ] **End Month 2**: V1 chapters coherent and complete
|
||||
- [ ] **End Month 3**: Priority 1 new chapters drafted (160 pages)
|
||||
- [ ] **End Month 4**: Priority 2 new chapters drafted (85 pages)
|
||||
- [ ] **End Month 5**: All new content drafted (345 pages)
|
||||
- [ ] **End Month 6**: Camera-ready manuscripts for both volumes
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Risks and Mitigation
|
||||
|
||||
### Risk 1: Page Count Imbalance
|
||||
**Risk**: V1 or V2 ends up significantly larger/smaller than target
|
||||
**Mitigation**:
|
||||
- Monitor page counts weekly
|
||||
- Adjust compression/expansion as needed
|
||||
- Have flexibility targets (±100 pages)
|
||||
|
||||
### Risk 2: Missing Dependencies
|
||||
**Risk**: V2 assumes V1 knowledge not actually covered
|
||||
**Mitigation**:
|
||||
- Create prerequisite matrix
|
||||
- Add recap sections to V2 chapters
|
||||
- Review cross-references monthly
|
||||
|
||||
### Risk 3: Timeline Slippage
|
||||
**Risk**: New chapter writing takes longer than estimated
|
||||
**Mitigation**:
|
||||
- Prioritize essential chapters first
|
||||
- Have backup plan to defer Priority 3 chapters
|
||||
- Build 2-week buffer into timeline
|
||||
|
||||
### Risk 4: Content Duplication
|
||||
**Risk**: Same concept explained in both volumes
|
||||
**Mitigation**:
|
||||
- Clear "basic vs. advanced" delineation
|
||||
- V2 references V1 explicitly
|
||||
- Review for overlap in Month 6
|
||||
|
||||
---
|
||||
|
||||
## 📚 Reference Materials
|
||||
|
||||
### Pedagogical Framework
|
||||
- **V1 Narrative**: Foundations → Building → Optimizing → Impact
|
||||
- **V2 Narrative**: Scale → Production → Responsibility
|
||||
- **Connection**: V1 ends with inspiration, V2 begins with bridge
|
||||
|
||||
### Chapter Surgery Guidelines
|
||||
- **Single-machine boundary**: Keep in V1
|
||||
- **Distributed systems**: Move to V2
|
||||
- **Production scale**: Move to V2
|
||||
- **Advanced optimization**: Move to V2
|
||||
|
||||
### Writing Standards
|
||||
- Timeless principles over current tech
|
||||
- Every chapter has: Purpose, Learning Outcomes, Summary
|
||||
- Concrete examples throughout
|
||||
- "Fallacies and Pitfalls" section
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Tools and Workflow
|
||||
|
||||
### Version Control
|
||||
- [ ] Create `volume-split` branch in Git
|
||||
- [ ] Track all changes in branch
|
||||
- [ ] Regular commits with clear messages
|
||||
|
||||
### Organization
|
||||
- `book/volume1/` - Volume 1 chapters
|
||||
- `book/volume2/` - Volume 2 chapters
|
||||
- `book/docs/` - All planning documents
|
||||
- `book/extracted/` - Content extracted from V1
|
||||
|
||||
### Quality Checks
|
||||
- [ ] Weekly page count tracking
|
||||
- [ ] Monthly cross-reference review
|
||||
- [ ] Technical accuracy spot checks
|
||||
- [ ] Pedagogical flow reviews
|
||||
|
||||
---
|
||||
|
||||
## 📞 Stakeholder Communication
|
||||
|
||||
### MIT Press Updates
|
||||
- **Monthly**: Progress report with page counts
|
||||
- **Major milestones**: Notify when phases complete
|
||||
- **Issues**: Immediate communication of risks
|
||||
|
||||
### Community/Reviewers
|
||||
- **End Month 2**: Share V1 draft for review
|
||||
- **End Month 4**: Share V2 draft chapters for review
|
||||
- **End Month 5**: Full review cycle
|
||||
|
||||
---
|
||||
|
||||
## ✅ Final Checklist (Month 6)
|
||||
|
||||
### Volume 1 Completion
|
||||
- [ ] All 14 chapters present and coherent
|
||||
- [ ] Page count: 1,150-1,250 pages
|
||||
- [ ] All cross-references to V2 marked clearly
|
||||
- [ ] Exercises and quizzes updated
|
||||
- [ ] Figures and tables numbered correctly
|
||||
- [ ] Bibliography complete
|
||||
- [ ] Index prepared
|
||||
|
||||
### Volume 2 Completion
|
||||
- [ ] All 15 chapters present and coherent
|
||||
- [ ] Page count: 1,100-1,200 pages
|
||||
- [ ] Bridge chapter effective
|
||||
- [ ] New chapters integrate extracted content
|
||||
- [ ] Exercises and quizzes complete
|
||||
- [ ] Figures and tables numbered correctly
|
||||
- [ ] Bibliography complete
|
||||
- [ ] Index prepared
|
||||
|
||||
### Both Volumes
|
||||
- [ ] Consistent notation across volumes
|
||||
- [ ] No content duplication
|
||||
- [ ] Clear prerequisite chain
|
||||
- [ ] Professional copyedit complete
|
||||
- [ ] Ready for MIT Press submission
|
||||
|
||||
---
|
||||
|
||||
## 📊 Success Metrics
|
||||
|
||||
### Quantitative
|
||||
- Volume 1: 1,150-1,250 pages ✓/✗
|
||||
- Volume 2: 1,100-1,200 pages ✓/✗
|
||||
- New content: 325-375 pages ✓/✗
|
||||
- Timeline: 6 months ✓/✗
|
||||
|
||||
### Qualitative
|
||||
- Each volume independently valuable ✓/✗
|
||||
- Clear pedagogical progression ✓/✗
|
||||
- MIT Press approval ✓/✗
|
||||
- Reviewer feedback positive ✓/✗
|
||||
|
||||
---
|
||||
|
||||
## 🎓 Post-Completion
|
||||
|
||||
### Publication Process
|
||||
- [ ] Submit to MIT Press
|
||||
- [ ] Incorporate editorial feedback
|
||||
- [ ] Final production review
|
||||
- [ ] Marketing materials
|
||||
- [ ] Course adoption outreach
|
||||
|
||||
### Maintenance
|
||||
- [ ] Errata tracking system
|
||||
- [ ] Annual review cycle
|
||||
- [ ] Community feedback integration
|
||||
- [ ] Future edition planning
|
||||
|
||||
---
|
||||
|
||||
## 📝 Notes and Decisions
|
||||
|
||||
### December 2024 - Project Launch
|
||||
- Decision: Committed to 6-month timeline
|
||||
- Decision: Will do full surgery, not quick split
|
||||
- Decision: Flagship quality is priority over speed
|
||||
- Next decision needed: [track decisions here]
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: December 7, 2024
|
||||
**Status**: Planning Complete - Ready to Begin Execution
|
||||
**Next Action**: Begin Chapter 6 (Data Engineering) surgery - Week 1
|
||||
|
||||
---
|
||||
|
||||
*This roadmap is the master coordination document for the two-volume split project. Update weekly with progress, decisions, and course corrections.*
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,216 +0,0 @@
|
||||
# Machine Learning Systems: Two-Volume Structure
|
||||
|
||||
**Proposal for MIT Press**
|
||||
*Draft: December 2024*
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The *Machine Learning Systems* textbook will be published as two complementary volumes of 14 chapters each:
|
||||
|
||||
| Volume | Title | Focus | Chapters |
|
||||
|--------|-------|-------|----------|
|
||||
| **Volume 1** | Introduction to Machine Learning Systems | Complete ML lifecycle, single-system focus | 14 (all existing) |
|
||||
| **Volume 2** | Advanced Machine Learning Systems | Principles of scale, distribution, and production | 14 (6 existing, 8 new) |
|
||||
|
||||
**Guiding Philosophy:**
|
||||
- **Volume 1**: Everything you need to build ML systems on a single machine, ending on a positive note with societal impact
|
||||
- **Volume 2**: Timeless principles for operating ML systems at scale, grounded in physics and mathematics rather than current technologies
|
||||
|
||||
---
|
||||
|
||||
## Volume 1: Introduction to Machine Learning Systems
|
||||
|
||||
*The complete ML lifecycle: understand it, build it, optimize it, deploy it, use it for good.*
|
||||
|
||||
| Part | Chapter | Description |
|
||||
|------|---------|-------------|
|
||||
| **Part I: Systems Foundations** | | *What are ML systems?* |
|
||||
| | 1. Introduction | Motivation and scope |
|
||||
| | 2. ML Systems | System-level view of machine learning |
|
||||
| | 3. Deep Learning Primer | Neural network fundamentals |
|
||||
| | 4. DNN Architectures | Modern architecture patterns |
|
||||
| **Part II: Design Principles** | | *How do you build ML systems?* |
|
||||
| | 5. Workflow | End-to-end ML pipeline design |
|
||||
| | 6. Data Engineering | Data collection, processing, validation |
|
||||
| | 7. Frameworks | PyTorch, TensorFlow, JAX ecosystem |
|
||||
| | 8. Training | Training loops, hyperparameters, convergence |
|
||||
| **Part III: Performance Engineering** | | *How do you make ML systems fast?* |
|
||||
| | 9. Efficient AI | Efficiency principles and metrics |
|
||||
| | 10. Optimizations | Quantization, pruning, distillation |
|
||||
| | 11. Hardware Acceleration | GPUs, TPUs, custom accelerators |
|
||||
| | 12. Benchmarking | Measurement, MLPerf, evaluation methodology |
|
||||
| **Part IV: Practice & Impact** | | *How do you deploy and use ML systems responsibly?* |
|
||||
| | 13. ML Operations | Deployment, monitoring, CI/CD for ML |
|
||||
| | 14. AI for Good | Positive societal applications |
|
||||
|
||||
**Total: 14 chapters across 4 parts (all existing content)**
|
||||
|
||||
*Early awareness:* include a short Sustainable AI note in Benchmarking or ML Operations to flag energy and carbon impacts without adding another chapter.
|
||||
|
||||
### Volume 1 Narrative Arc
|
||||
|
||||
The book progresses from understanding → building → optimizing → deploying → impact:
|
||||
|
||||
1. **Foundations** establish what ML systems are and why they matter
|
||||
2. **Design** teaches how to construct complete pipelines
|
||||
3. **Performance** shows how to make systems efficient
|
||||
4. **Practice & Impact** completes the lifecycle and ends on an inspirational note
|
||||
|
||||
Ending on "AI for Good" leaves students with a positive vision of what they can build.
|
||||
|
||||
---
|
||||
|
||||
## Volume 2: Advanced Machine Learning Systems
|
||||
|
||||
*Timeless principles for building and operating ML systems at scale.*
|
||||
|
||||
| Part | Chapter | Status | Description |
|
||||
|------|---------|--------|-------------|
|
||||
| **Part I: Data Movement & Memory** | | | *Moving data is the bottleneck* |
|
||||
| | 1. Memory Hierarchies for ML | 🆕 NEW | GPU memory, HBM, activation checkpointing |
|
||||
| | 2. Storage Systems for ML | 🆕 NEW | Distributed storage, checkpointing, feature stores |
|
||||
| | 3. Communication & Collective Operations | 🆕 NEW | AllReduce, gradient compression, network topology |
|
||||
| **Part II: Parallelism & Coordination** | | | *Decomposing computation across machines* |
|
||||
| | 4. Distributed Training | 🆕 NEW | Data/model/pipeline/tensor parallelism |
|
||||
| | 5. Fault Tolerance & Recovery | 🆕 NEW | Checkpointing, elastic training, failure handling |
|
||||
| | 6. Inference Systems | 🆕 NEW | Batching, serving architectures, autoscaling |
|
||||
| **Part III: Constrained Environments** | | | *Doing more with less* |
|
||||
| | 7. On-device Learning | Existing | Training and adaptation on edge devices |
|
||||
| | 8. Edge Deployment | 🆕 NEW | Compilation, runtime optimization, real-time |
|
||||
| **Part IV: Adversarial Environments** | | | *Systems under attack and uncertainty* |
|
||||
| | 9. Privacy in ML Systems | Existing | Differential privacy, federated learning, secure aggregation |
|
||||
| | 10. Security in ML Systems | 🆕 NEW | Supply chain, API security, multi-tenant isolation |
|
||||
| | 11. Robust AI | Existing | Adversarial robustness, distribution shift, monitoring |
|
||||
| **Part V: Stewardship** | | | *Building systems that serve humanity* |
|
||||
| | 12. Responsible AI | Existing | Fairness, accountability, transparency at scale |
|
||||
| | 13. Sustainable AI | Existing | Energy efficiency, carbon footprint, environmental impact |
|
||||
| | 14. Frontiers & Future Directions | Existing | Emerging paradigms, open problems, conclusion |
|
||||
|
||||
**Total: 14 chapters across 5 parts (6 existing, 8 new)**
|
||||
|
||||
---
|
||||
|
||||
## New Content for Volume 2
|
||||
|
||||
### Part I: Data Movement & Memory
|
||||
*The physics of data movement is the fundamental constraint in modern ML.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Memory Hierarchies for ML** | GPU memory management, HBM architecture, caching strategies, activation checkpointing, memory-efficient attention | Memory bandwidth limits compute utilization |
|
||||
| **Storage Systems for ML** | Distributed file systems, checkpoint I/O, feature stores, data lakes, prefetching, I/O scheduling | Storage throughput gates training speed |
|
||||
| **Communication & Collective Operations** | AllReduce algorithms, ring/tree topologies, gradient compression, RDMA fundamentals, network topology design | Communication overhead limits scaling |
|
||||
|
||||
### Part II: Parallelism & Coordination
|
||||
*The mathematics of decomposing work across machines.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Distributed Training** | Data parallelism, model parallelism (tensor, pipeline, expert), hybrid strategies, synchronization, load balancing | Parallelism has fundamental trade-offs |
|
||||
| **Fault Tolerance & Recovery** | Checkpoint strategies, async checkpointing, elastic training, failure detection, graceful degradation | Large systems fail; recovery must be designed in |
|
||||
| **Inference Systems** | Batching strategies, continuous batching, KV cache management, model serving patterns, autoscaling, SLO management | Serving has different constraints than training |
|
||||
|
||||
### Part III: Constrained Environments
|
||||
*Operating under resource limitations.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Edge Deployment** | Model compilation, runtime optimization, heterogeneous hardware, real-time constraints, power management | Constraints force creativity |
|
||||
|
||||
### Part IV: Adversarial Environments
|
||||
*Systems facing attacks, privacy requirements, and uncertainty.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Security in ML Systems** | Model provenance, supply chain security, API protection, multi-tenant isolation, access control | Production systems face adversaries |
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
### Why This Structure Works
|
||||
|
||||
**Volume 1 (Single System)**
|
||||
- Teaches the complete lifecycle
|
||||
- Everything can be learned and practiced on one machine
|
||||
- Ends positively with societal impact
|
||||
|
||||
**Volume 2 (Distributed Systems)**
|
||||
- Builds on Volume 1 foundations
|
||||
- Addresses what changes at scale
|
||||
- Organized around timeless constraints, not current technologies
|
||||
|
||||
### What Makes Volume 2 Timeless
|
||||
|
||||
Each part addresses constraints rooted in physics, mathematics, or human nature:
|
||||
|
||||
| Part | Eternal Constraint | Foundation |
|
||||
|------|-------------------|------------|
|
||||
| Data Movement & Memory | Moving data costs more than compute | Physics: speed of light, memory bandwidth |
|
||||
| Parallelism & Coordination | Work must be decomposed and synchronized | Mathematics of parallel computation |
|
||||
| Constrained Environments | Resources are always finite | Economics and physics |
|
||||
| Adversarial Environments | Attackers and uncertainty exist | Human nature, statistics |
|
||||
| Stewardship | Technology must serve humanity | Ethics, sustainability |
|
||||
|
||||
Chapters use current examples (LLMs, transformers, specific hardware) but frame them as instances of these enduring principles.
|
||||
|
||||
---
|
||||
|
||||
## Content Migration Summary
|
||||
|
||||
| Chapter | Volume 1 | Volume 2 | Rationale |
|
||||
|---------|----------|----------|-----------|
|
||||
| Introduction through Benchmarking | ✓ | | Core technical content |
|
||||
| ML Operations | ✓ | | Completes the lifecycle |
|
||||
| AI for Good | ✓ | | Positive conclusion |
|
||||
| On-device Learning | | ✓ | Edge/constrained is advanced |
|
||||
| Privacy & Security | | ✓ | Production security is advanced |
|
||||
| Robust AI | | ✓ | Production robustness is advanced |
|
||||
| Responsible AI | | ✓ | Scale changes the challenges |
|
||||
| Sustainable AI | | ✓ | Datacenter scale is advanced |
|
||||
| Frontiers | | ✓ | Conclusion for advanced volume |
|
||||
|
||||
---
|
||||
|
||||
## Audience
|
||||
|
||||
| Volume | Primary Audience | Use Cases |
|
||||
|--------|-----------------|-----------|
|
||||
| Volume 1 | All ML practitioners, undergraduates, bootcamp students | First course in ML systems, self-study |
|
||||
| Volume 2 | Infrastructure engineers, graduate students, researchers | Advanced course, reference for practitioners at scale |
|
||||
|
||||
---
|
||||
|
||||
## Collaboration Model
|
||||
|
||||
Volume 2's new chapters are candidates for collaborative authorship:
|
||||
|
||||
| Topic Area | Ideal Collaborator Profile |
|
||||
|------------|---------------------------|
|
||||
| Memory & Storage | Datacenter architects, MLPerf Storage contributors |
|
||||
| Networking & Communication | Distributed systems researchers, framework developers |
|
||||
| Distributed Training | PyTorch/JAX distributed teams, hyperscaler engineers |
|
||||
| Fault Tolerance | Site reliability engineers, systems researchers |
|
||||
| Inference Systems | ML serving infrastructure engineers |
|
||||
| Edge Deployment | Embedded ML practitioners, compiler engineers |
|
||||
| Security | ML security researchers, production security engineers |
|
||||
|
||||
---
|
||||
|
||||
## Summary Statistics
|
||||
|
||||
| Metric | Volume 1 | Volume 2 |
|
||||
|--------|----------|----------|
|
||||
| Chapters | 14 | 14 |
|
||||
| Parts | 4 | 5 |
|
||||
| Existing content | 14 | 6 |
|
||||
| New content | 0 | 8 |
|
||||
| Focus | Single system | Distributed systems |
|
||||
| Prerequisite | None | Volume 1 |
|
||||
|
||||
---
|
||||
|
||||
*Document Version: December 2024*
|
||||
*For discussion with MIT Press and potential collaborators*
|
||||
@@ -1,175 +0,0 @@
|
||||
# Round 1 Reviewer Feedback Synthesis
|
||||
|
||||
## Reviewers
|
||||
- **David Patterson** - Computer architecture, textbook author
|
||||
- **Ion Stoica** - Distributed systems (Ray, Spark), Berkeley
|
||||
- **Vijay Reddi** - TinyML, MLPerf, Harvard
|
||||
- **Jeff Dean** - Google Senior Fellow, large-scale systems
|
||||
|
||||
---
|
||||
|
||||
## Consensus Issues (All/Most Reviewers Agree)
|
||||
|
||||
### 1. Volume I is Incomplete Without Some Production/Scale Awareness
|
||||
|
||||
**Patterson**: "Cannot teach deployment without responsibility integration"
|
||||
**Stoica**: "Data parallelism is now foundational, not advanced"
|
||||
**Reddi**: "Edge deployment is fundamental, not advanced"
|
||||
**Dean**: "Scale thinking should be woven throughout Vol I, not deferred"
|
||||
|
||||
**Consensus**: Vol I currently produces graduates who lack awareness of:
|
||||
- Distributed training basics (Stoica, Dean)
|
||||
- Resource constraints/edge deployment (Reddi)
|
||||
- Responsible practices (Patterson)
|
||||
- Cost and production realities (Dean)
|
||||
|
||||
### 2. Chapter 14 "Preview" Approach is Problematic
|
||||
|
||||
**Patterson**: "Pedagogically misguided - responsibility should be integrated throughout"
|
||||
**All others**: Generally agree preview is insufficient
|
||||
|
||||
**Consensus**: The "preview" approach treats important topics as afterthoughts.
|
||||
|
||||
### 3. Volume II Part I/II Ordering Needs Work
|
||||
|
||||
**Stoica**: "Teaching infrastructure before algorithms is pedagogically backwards"
|
||||
**Dean**: Agrees infra should come with context
|
||||
|
||||
**Consensus**: Teach distributed algorithms first, then infrastructure that supports them.
|
||||
|
||||
### 4. Missing Hands-On/Practical Content
|
||||
|
||||
**Patterson**: "No mention of labs or programming assignments"
|
||||
**Reddi**: "Students cannot deploy to microcontroller after Vol I"
|
||||
**Dean**: "Missing debugging and profiling skills"
|
||||
|
||||
**Consensus**: Both volumes need explicit practical components.
|
||||
|
||||
---
|
||||
|
||||
## Key Disagreements/Different Emphases
|
||||
|
||||
### What Should Move to Volume I?
|
||||
|
||||
| Topic | Patterson | Stoica | Reddi | Dean |
|
||||
|-------|-----------|--------|-------|------|
|
||||
| Edge/TinyML deployment | Maybe | - | **CRITICAL** | - |
|
||||
| Data parallelism basics | - | **CRITICAL** | - | Important |
|
||||
| Checkpointing | - | **CRITICAL** | - | Important |
|
||||
| Responsible AI integration | **CRITICAL** | - | - | - |
|
||||
| Cost awareness | - | - | - | **CRITICAL** |
|
||||
|
||||
**Tension**: Each reviewer wants their specialty area elevated in Vol I. Cannot add everything without making Vol I too large.
|
||||
|
||||
### Chapter Count
|
||||
|
||||
- **Patterson**: 14 chapters OK if balanced, concerned about page counts
|
||||
- **Reddi**: Suggests 15 chapters (add edge deployment)
|
||||
- **Stoica/Dean**: Focus less on count, more on content depth
|
||||
|
||||
---
|
||||
|
||||
## Specific Recommendations by Volume
|
||||
|
||||
### Volume I Additions (Ranked by Consensus)
|
||||
|
||||
| Priority | Addition | Supporters |
|
||||
|----------|----------|------------|
|
||||
| HIGH | Data parallelism basics in Ch 8 (Training) | Stoica, Dean |
|
||||
| HIGH | Checkpointing basics in Ch 8 | Stoica, Dean |
|
||||
| HIGH | Resource-constrained deployment chapter | Reddi (strong), Patterson (partial) |
|
||||
| HIGH | Cost/efficiency awareness throughout | Dean |
|
||||
| MEDIUM | Integrate responsibility throughout (not just Ch 14) | Patterson |
|
||||
| MEDIUM | Expand quantization/pruning depth (Ch 10) | Reddi |
|
||||
| MEDIUM | Strengthen benchmarking rigor (Ch 12) | Reddi |
|
||||
|
||||
### Volume II Restructuring
|
||||
|
||||
| Priority | Change | Supporters |
|
||||
|----------|--------|------------|
|
||||
| HIGH | Reorder Parts I/II (algorithms before infrastructure) | Stoica, Dean |
|
||||
| HIGH | Add distributed systems theory basics | Stoica |
|
||||
| MEDIUM | Add production debugging chapter | Dean, Stoica |
|
||||
| MEDIUM | Expand MLOps chapter significantly | Dean |
|
||||
| MEDIUM | Add cost/resource management | Dean |
|
||||
|
||||
---
|
||||
|
||||
## The Central Dilemma
|
||||
|
||||
**Cannot add everything without making volumes too large.**
|
||||
|
||||
Options:
|
||||
|
||||
### Option A: Minimal Vol I Changes (Original + Small Additions)
|
||||
- Keep current 14-chapter structure
|
||||
- Add data parallelism section to Ch 8
|
||||
- Add checkpointing section to Ch 8
|
||||
- Strengthen Ch 14 preview (but keep as preview)
|
||||
- Vol II restructures Part I/II ordering
|
||||
|
||||
**Pro**: Minimal disruption, faster to implement
|
||||
**Con**: Patterson and Reddi concerns not fully addressed
|
||||
|
||||
### Option B: Add Edge Deployment to Vol I (Reddi's Recommendation)
|
||||
- Add new Chapter 12: "Resource-Constrained Deployment"
|
||||
- Renumber remaining chapters (15 total)
|
||||
- Expand quantization/pruning depth
|
||||
- Vol II restructures Part I/II
|
||||
|
||||
**Pro**: Addresses critical industry need (mobile/embedded)
|
||||
**Con**: Makes Vol I larger, may be too ambitious
|
||||
|
||||
### Option C: Integrate Responsibility Throughout Vol I (Patterson's Recommendation)
|
||||
- Distribute responsible systems content across chapters
|
||||
- Remove standalone Ch 14 preview
|
||||
- Add fairness to Ch 6 (Data), security to Ch 10, sustainability to Ch 9
|
||||
- Keep 14 chapters but redistribute content
|
||||
|
||||
**Pro**: Pedagogically sounder integration
|
||||
**Con**: Significant rewrite of multiple chapters
|
||||
|
||||
### Option D: Hybrid - Core Additions Only
|
||||
- Add data parallelism + checkpointing to Ch 8 (Stoica/Dean consensus)
|
||||
- Add brief edge deployment section to Ch 13 (MLOps) not new chapter
|
||||
- Keep Ch 14 but strengthen it with integrated callouts in earlier chapters
|
||||
- Vol II restructures Part I/II
|
||||
|
||||
**Pro**: Addresses highest-consensus items without major restructure
|
||||
**Con**: Doesn't fully satisfy any single reviewer
|
||||
|
||||
---
|
||||
|
||||
## Recommendation for User Decision
|
||||
|
||||
**Suggested path forward**: Option D (Hybrid) for Vol I structure, with Vol II restructuring.
|
||||
|
||||
**Rationale**:
|
||||
1. Stoica and Dean (industry leaders in distributed systems) agree on data parallelism/checkpointing - this is the highest consensus item
|
||||
2. Full edge deployment chapter (Reddi) is valuable but may be too ambitious for immediate restructure
|
||||
3. Full responsibility integration (Patterson) is pedagogically ideal but requires significant rewrite
|
||||
4. Vol II restructuring (algorithms before infrastructure) has clear consensus
|
||||
|
||||
**What this means for your current draft**:
|
||||
|
||||
**Volume I Changes**:
|
||||
- Ch 8 (Training): Add "Distributed Training Fundamentals" section
|
||||
- Ch 8 (Training): Add "Checkpointing" section
|
||||
- Ch 9 (Efficiency): Add brief energy/sustainability measurement
|
||||
- Ch 14 (Preview): Strengthen, add forward references throughout earlier chapters
|
||||
|
||||
**Volume II Changes**:
|
||||
- Reorder: Distributed Training → Communication → Fault Tolerance → THEN Infrastructure
|
||||
- Add brief theory section to Ch 1
|
||||
|
||||
---
|
||||
|
||||
## Questions for User
|
||||
|
||||
1. **Edge deployment priority**: Is adding a full edge deployment chapter to Vol I worth the extra scope? (Reddi makes a strong case for industry relevance)
|
||||
|
||||
2. **Responsibility integration**: Should we integrate responsible AI throughout Vol I chapters (Patterson's strong recommendation), or keep the preview approach?
|
||||
|
||||
3. **Page count targets**: Do you have MIT Press guidance on target page counts? This affects how much we can add.
|
||||
|
||||
4. **Volume II priority**: Is restructuring Vol II Part I/II ordering acceptable, or is that structure already locked?
|
||||
@@ -1,207 +0,0 @@
|
||||
# Machine Learning Systems: Two-Volume Structure
|
||||
|
||||
**Status**: Approved (Round 2 Review Complete)
|
||||
**Target Publisher**: MIT Press
|
||||
**Audience**: Undergraduate and graduate CS/ECE students, academic courses
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This textbook is being split into two volumes to serve different learning objectives:
|
||||
|
||||
- **Volume I: Introduction to ML Systems** - Foundational knowledge for building, optimizing, and deploying ML systems
|
||||
- **Volume II: Advanced ML Systems** - Scale, distributed systems, production hardening, and responsible deployment
|
||||
|
||||
Each volume should stand alone as a complete learning experience while together forming a comprehensive treatment of the field.
|
||||
|
||||
---
|
||||
|
||||
## Volume I: Introduction to ML Systems
|
||||
|
||||
### Goal
|
||||
A reader completes Volume I and can competently build, optimize, and deploy ML systems with awareness of responsible practices.
|
||||
|
||||
### Target Audience
|
||||
- Upper-level undergraduates
|
||||
- Early graduate students
|
||||
- Practitioners transitioning into ML systems
|
||||
|
||||
### Course Mapping
|
||||
- Single semester "Introduction to ML Systems" course
|
||||
- Foundation for more advanced distributed systems or MLOps courses
|
||||
|
||||
### Structure (14 chapters)
|
||||
|
||||
#### Part I: Foundations
|
||||
Establish the conceptual framework for understanding ML as a systems discipline.
|
||||
|
||||
| Ch | Title | Purpose |
|
||||
|----|-------|---------|
|
||||
| 1 | Introduction to Machine Learning Systems | Why ML systems thinking matters |
|
||||
| 2 | The ML Systems Landscape | Survey of the field, key components |
|
||||
| 3 | Deep Learning Foundations | Mathematical and conceptual foundations |
|
||||
| 4 | Modern Neural Architectures | CNNs, RNNs, Transformers, architectural choices |
|
||||
|
||||
#### Part II: Development
|
||||
Practical skills for constructing ML systems from data to trained model.
|
||||
|
||||
| Ch | Title | Purpose |
|
||||
|----|-------|---------|
|
||||
| 5 | ML Development Workflow | End-to-end process, experimentation |
|
||||
| 6 | Data Engineering for ML | Pipelines, preprocessing, data quality |
|
||||
| 7 | ML Frameworks and Tools | PyTorch, TensorFlow, ecosystem |
|
||||
| 8 | Training Systems | Training loops, distributed basics, debugging |
|
||||
|
||||
#### Part III: Optimization
|
||||
Techniques for making ML systems efficient and fast.
|
||||
|
||||
| Ch | Title | Purpose |
|
||||
|----|-------|---------|
|
||||
| 9 | Efficiency in AI Systems | Why efficiency matters, metrics |
|
||||
| 10 | Model Optimization Techniques | Quantization, pruning, distillation |
|
||||
| 11 | Hardware Acceleration | GPUs, TPUs, custom accelerators |
|
||||
| 12 | Benchmarking and Evaluation | Measuring performance, MLPerf |
|
||||
|
||||
#### Part IV: Operations
|
||||
Getting models into production responsibly.
|
||||
|
||||
| Ch | Title | Purpose |
|
||||
|----|-------|---------|
|
||||
| 13 | ML Operations Fundamentals | Deployment, monitoring, CI/CD for ML |
|
||||
| 14 | Responsible Systems Preview | Brief intro to robustness, security, fairness, sustainability (preview of Vol II topics) |
|
||||
|
||||
#### Volume I Conclusion
|
||||
Synthesis, what was learned, bridge to Volume II.
|
||||
|
||||
---
|
||||
|
||||
## Volume II: Advanced ML Systems
|
||||
|
||||
### Goal
|
||||
A reader completes Volume II understanding how to build and operate ML systems at scale, with production resilience and responsible practices.
|
||||
|
||||
### Target Audience
|
||||
- Graduate students
|
||||
- Industry practitioners
|
||||
- Researchers building large-scale systems
|
||||
|
||||
### Prerequisites
|
||||
- Volume I or equivalent knowledge
|
||||
- Basic distributed systems concepts helpful
|
||||
|
||||
### Course Mapping
|
||||
- Graduate seminar on large-scale ML systems
|
||||
- Advanced MLOps course
|
||||
- Research group reading material
|
||||
|
||||
### Structure (16 chapters)
|
||||
|
||||
#### Part I: Foundations of Scale
|
||||
Infrastructure and concepts for scaling beyond single machines.
|
||||
|
||||
| Ch | Title | Purpose |
|
||||
|----|-------|---------|
|
||||
| 1 | From Single Systems to Planetary Scale | Motivation, challenges of scale |
|
||||
| 2 | Infrastructure for Large-Scale ML | Clusters, cloud, resource management |
|
||||
| 3 | Storage Systems for ML | Data lakes, distributed storage, checkpointing |
|
||||
| 4 | Communication and Collective Operations | AllReduce, parameter servers, network topology |
|
||||
|
||||
#### Part II: Distributed Systems
|
||||
Training and inference across multiple machines.
|
||||
|
||||
| Ch | Title | Purpose |
|
||||
|----|-------|---------|
|
||||
| 5 | Distributed Training Systems | Data parallel, model parallel, pipeline parallel |
|
||||
| 6 | Fault Tolerance and Resilience | Checkpointing, recovery, handling failures |
|
||||
| 7 | Inference at Scale | Serving systems, batching, latency optimization |
|
||||
| 8 | Edge Intelligence Systems | Deploying ML at the edge, constraints |
|
||||
|
||||
#### Part III: Production Challenges
|
||||
Real-world complexities of operating ML systems.
|
||||
|
||||
| Ch | Title | Purpose |
|
||||
|----|-------|---------|
|
||||
| 9 | On-Device Learning | Training on edge devices, federated learning |
|
||||
| 10 | Privacy-Preserving ML Systems | Differential privacy, secure computation |
|
||||
| 11 | Robust and Reliable AI | Adversarial robustness, distribution shift |
|
||||
| 12 | ML Operations at Scale | Advanced MLOps, platform engineering |
|
||||
|
||||
#### Part IV: Responsible Deployment
|
||||
Building ML systems that benefit society.
|
||||
|
||||
| Ch | Title | Purpose |
|
||||
|----|-------|---------|
|
||||
| 13 | Responsible AI Systems | Fairness, accountability, transparency |
|
||||
| 14 | Sustainable AI | Environmental impact, efficient computing |
|
||||
| 15 | AI for Good | Applications for societal benefit |
|
||||
| 16 | Frontiers and Future Directions | Emerging trends, open problems |
|
||||
|
||||
#### Volume II Conclusion
|
||||
Synthesis, future of the field, call to action.
|
||||
|
||||
---
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
### Why This Split?
|
||||
|
||||
1. **Pedagogical Progression**: Vol I covers what every ML practitioner needs. Vol II covers what scale/production engineers need.
|
||||
|
||||
2. **Course Adoptability**: Vol I maps to a single semester intro course. Vol II maps to an advanced graduate seminar.
|
||||
|
||||
3. **Standalone Completeness**: A reader of only Vol I still gets responsible systems awareness through Chapter 14.
|
||||
|
||||
4. **Industry Alignment**: Vol I produces capable junior engineers. Vol II produces senior/staff-level systems thinkers.
|
||||
|
||||
### Chapter 14 in Volume I: Responsible Systems Preview
|
||||
|
||||
This chapter is intentionally brief (preview, not deep dive) covering:
|
||||
- Robustness basics (models can fail)
|
||||
- Security basics (models can be attacked)
|
||||
- Fairness basics (models can discriminate)
|
||||
- Sustainability basics (training has environmental cost)
|
||||
|
||||
Each topic points to the relevant Volume II chapter for deep treatment. This ensures Vol I readers are aware of these concerns without duplicating Vol II content.
|
||||
|
||||
### What Moves Between Volumes?
|
||||
|
||||
From original single-book structure:
|
||||
- On-Device Learning → Vol II (requires scale context)
|
||||
- Privacy/Security → Vol II (production concern)
|
||||
- Robust AI → Vol II (advanced topic)
|
||||
- Responsible AI → Vol II (deep treatment)
|
||||
- Sustainable AI → Vol II (deep treatment)
|
||||
- AI for Good → Vol II (capstone application)
|
||||
- Frontiers → Vol II (forward-looking capstone)
|
||||
|
||||
---
|
||||
|
||||
## Questions for Reviewers
|
||||
|
||||
1. Does Volume I stand alone as a complete, responsible introduction to ML systems?
|
||||
|
||||
2. Is the progression within each volume logical for students?
|
||||
|
||||
3. Would you adopt Volume I for an introductory ML systems course?
|
||||
|
||||
4. Is Chapter 14 (Responsible Systems Preview) sufficient, or should Vol I include more depth on any topic?
|
||||
|
||||
5. Are any chapters misplaced between volumes?
|
||||
|
||||
6. Is 14 chapters (Vol I) and 16 chapters (Vol II) appropriate sizing?
|
||||
|
||||
7. What's missing from either volume?
|
||||
|
||||
---
|
||||
|
||||
## Revision History
|
||||
|
||||
- **v0.1** (2024-12-31): Initial draft for review
|
||||
- **v0.2** (2024-12-31): Updated Part names based on reviewer feedback
|
||||
- Volume I: Single-word Part names (Foundations, Development, Optimization, Operations)
|
||||
- Volume II: Two-word Part names (unchanged, already clear)
|
||||
- **v1.0** (2024-12-31): Structure approved after Round 2 review
|
||||
- All reviewers (Patterson, Stoica, Reddi, Dean) approve chapter structure
|
||||
- Part naming convention approved
|
||||
- Ready for website implementation
|
||||
Reference in New Issue
Block a user