mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-04-29 17:20:21 -05:00

Files

Vijay Janapa Reddi bee262de3e docs: add comprehensive volume split planning documents

- Add MIT_PRESS_PROPOSAL.md: polished proposal for two-volume series
- Add VOLUME_SPLIT_ROADMAP.md: 6-month master project plan with progress tracking
- Add VOLUME_SPLIT_SURGICAL_PLAN.md: detailed section-by-section surgery instructions

These documents establish the complete roadmap for splitting the ML Systems
textbook into two flagship volumes for MIT Press publication.

2025-12-07 06:54:55 -08:00

14 KiB

Raw Blame History

Machine Learning Systems: Two-Volume Split - Master Roadmap

Project Start Date: December 2024 Target Completion: June 2025 (6 months) Goal: Create flagship two-volume ML Systems textbook series for MIT Press

📋 Project Documents

Core Planning Documents

MIT_PRESS_PROPOSAL.md - Official proposal for MIT Press (ready to submit)
VOLUME_SPLIT_SURGICAL_PLAN.md - Section-by-section surgery instructions (50+ pages)
VOLUME_SPLIT_ROADMAP.md - This document: master project plan and progress tracker
VOLUME_STRUCTURE_PROPOSAL.md - Original analysis and proposal

Supporting Documents

VOLUME_SPLIT_ANALYSIS.md - Deep analysis of pedagogical issues
VOLUME_SPLIT_EXECUTIVE_SUMMARY.md - Quick reference summary
DISTRIBUTED_CONTENT_ADDITIONS.md - Distributed awareness additions for V1

🎯 Project Vision

Volume I: Introduction to ML Systems (~1,150-1,200 pages)

Complete single-system ML engineering
Includes distributed awareness (not implementation)
Target: Undergrads, bootcamps, ML engineers entering the field

Volume II: Advanced ML Systems (~1,100-1,150 pages)

Distributed systems, production scale, responsibility
Built on timeless principles
Target: Graduate students, ML infrastructure engineers, senior practitioners

The One-Liner:

"Volume I teaches you to build ML systems that work; Volume II teaches you to build ML systems that scale."

📊 Current State (December 2024)

Existing Content

Total pages: 2,172 pages across 22 chapters
Status: Complete draft of comprehensive single-volume book
Quality: Refined through extensive review feedback

Content Distribution

Chapters 1-14: Form basis of Volume 1 (with surgery)
Chapters 15-21: Move to Volume 2
New content needed: 325-375 pages (8 new chapters for V2)

🗓️ Six-Month Timeline

Phase 1: Chapter Surgery (Months 1-2)

Goal: Extract distributed content from 7 chapters, create clean V1

Month 1: Content Extraction

Week 1-2: Extract distributed content from Chapters 6, 7, 8
- Chapter 6 (Data Engineering): Extract 40 pages of distributed content
- Chapter 7 (Frameworks): Extract 50 pages of distributed execution
- Chapter 8 (Training): Extract 60 pages of distributed training
Week 3-4: Extract from Chapters 10, 11, 12, 13
- Chapter 10 (Optimizations): Extract 80 pages (NAS, AutoML)
- Chapter 11 (Hardware): Extract 50 pages (multi-chip)
- Chapter 12 (Benchmarking): Extract 40 pages (distributed benchmarking)
- Chapter 13 (MLOps): Extract 30 pages (production scale)

Month 2: Bridging and Polish

Week 5-6: Create V1 transitions
- Add "See Volume 2" callout boxes in V1
- Write brief distributed awareness sections
- Ensure V1 chapters remain coherent
Week 7-8: Organize extracted content
- Create V2 chapter structure
- Place extracted content in appropriate V2 chapters
- Identify gaps in extracted content

Phase 2: New Content Development (Months 3-5)

Month 3: Priority 1 Chapters (Essential Infrastructure)

Week 9-10: Memory & Storage
- V2 Ch1: Memory Hierarchies for ML (45 pages)
- V2 Ch2: Storage Systems for ML (40 pages)
Week 11-12: Communication & Distributed Training
- V2 Ch3: Communication & Collective Operations (45 pages)
- V2 Ch4: Distributed Training Systems (50 pages) - integrate extracted content

Month 4: Priority 2 Chapters (Production Requirements)

Week 13-14: Fault Tolerance & Inference
- V2 Ch5: Fault Tolerance & Resilience (40 pages)
- V2 Ch6: Inference at Scale (45 pages)
Week 15-16: Integration
- Integrate all extracted content into new chapters
- Write chapter introductions and conclusions
- Create cross-references

Month 5: Priority 3 Chapters (Specialized Topics)

Week 17-18: Edge Systems
- V2 Ch8: Edge Intelligence Systems (50 pages)
- Integrate extracted edge content from Ch2
Week 19-20: Final new chapters
- V2 Ch1: Bridge Chapter (30 pages) - From Single to Distributed
- Update existing V2 chapters (15-20) with introductions

Phase 3: Integration and Polish (Month 6)

Month 6: Final Integration

Week 21-22: Cross-References and Consistency
- Update all V1→V2 cross-references
- Update all V2→V1 prerequisite references
- Ensure consistent notation across volumes
- Verify all figure references work
Week 23: Narrative Flow
- Review V1 narrative arc (Foundations → Building → Optimizing → Impact)
- Review V2 narrative arc (Scale → Production → Responsibility)
- Polish chapter transitions
- Write volume prefaces
Week 24: Final Quality Checks
- Technical accuracy review
- Page count verification
- Exercise and quiz consistency
- Final copyedit pass
- Prepare camera-ready manuscripts

📈 Progress Tracking

Volume 1 Progress (Target: 1,150-1,200 pages)

Chapter	Current	Target	Surgery Status	Notes
1. Introduction	90	60	⬜ Not Started	Compress history section
2. ML Systems	70	70	⬜ Not Started	Extract hybrid architectures
3. DL Primer	110	100	⬜ Not Started	No surgery (keep as-is)
4. DNN Architectures	82	100	⬜ Not Started	No surgery (keep as-is)
5. Workflow	51	40	⬜ Not Started	Minor compression
6. Data Engineering	138	80	⬜ Not Started	Extract distributed storage
7. Frameworks	121	100	⬜ Not Started	Extract distributed execution
8. Training	157	100	⬜ Not Started	Extract distributed training
9. Efficient AI	52	60	⬜ Not Started	No surgery (keep as-is)
10. Optimizations	160	120	⬜ Not Started	Extract NAS, AutoML
11. Hardware	181	90	⬜ Not Started	Extract multi-chip
12. Benchmarking	124	80	⬜ Not Started	Extract distributed benchmarking
13. MLOps	126	50	⬜ Not Started	Extract production scale
14. AI for Good	84	50	⬜ Not Started	Minor compression
TOTAL	1,546	1,100	0%	V1 baseline complete

Progress Key: ⬜ Not Started | 🟨 In Progress | ✅ Complete

Volume 2 Progress (Target: 1,100-1,150 pages)

Chapter	Source	Target	Status	Notes
1. Bridge: Single to Distributed	NEW	30	⬜ Not Started	Write from scratch
2. Memory Hierarchies	NEW + Ch11	45	⬜ Not Started	New content + extracts
3. Storage Systems	NEW + Ch6	40	⬜ Not Started	New content + extracts
4. Communication & Collectives	NEW + Ch8	45	⬜ Not Started	New content + extracts
5. Distributed Training	Ch8 + Ch10	50	⬜ Not Started	Consolidate extracts
6. Fault Tolerance	NEW + Ch13	40	⬜ Not Started	New content + extracts
7. Inference at Scale	NEW + Ch2 + Ch13	45	⬜ Not Started	New content + extracts
8. Edge Intelligence	NEW + Ch2	50	⬜ Not Started	New content + extracts
9. On-Device Learning	Ch14 (existing)	127	⬜ Not Started	Move from V1
10. Privacy Systems	Ch15 (split)	65	⬜ Not Started	Split Privacy/Security
11. Security Systems	Ch15 (split)	68	⬜ Not Started	Split Privacy/Security
12. Robust AI	Ch16 (existing)	137	⬜ Not Started	Move from V1
13. Responsible AI	Ch17 (existing)	135	⬜ Not Started	Move from V1
14. Sustainable AI	Ch18 (existing)	46	⬜ Not Started	Move from V1
15. Frontiers & AGI	Ch19+20 (merge)	78	⬜ Not Started	Merge two chapters
TOTAL	Mixed	1,001	0%	Need ~100 more pages

New Content Writing Progress (325-375 pages needed)

Chapter	Pages	Draft	Review	Final	Notes
Bridge Chapter	30	⬜	⬜	⬜	Priority 1
Memory Hierarchies	45	⬜	⬜	⬜	Priority 1
Storage Systems	40	⬜	⬜	⬜	Priority 1
Communication	45	⬜	⬜	⬜	Priority 2
Distributed Training	50	⬜	⬜	⬜	Priority 1 (+ extracts)
Fault Tolerance	40	⬜	⬜	⬜	Priority 2
Inference at Scale	45	⬜	⬜	⬜	Priority 1
Edge Intelligence	50	⬜	⬜	⬜	Priority 3
TOTAL NEW	345	0%	0%	0%	8 new chapters

📝 Weekly Progress Log

Week 1 (Dec 2-8, 2024)

Created comprehensive surgical plan
Created MIT Press proposal
Created master roadmap
Next: Begin Chapter 6 surgery

Week 2 (Dec 9-15, 2024)

Progress notes...

Week 3 (Dec 16-22, 2024)

Progress notes...

(Continue weekly log throughout 6-month period)

🎯 Key Milestones

End Month 1: All distributed content extracted from V1 chapters
End Month 2: V1 chapters coherent and complete
End Month 3: Priority 1 new chapters drafted (160 pages)
End Month 4: Priority 2 new chapters drafted (85 pages)
End Month 5: All new content drafted (345 pages)
End Month 6: Camera-ready manuscripts for both volumes

⚠️ Risks and Mitigation

Risk 1: Page Count Imbalance

Risk: V1 or V2 ends up significantly larger/smaller than target Mitigation:

Monitor page counts weekly
Adjust compression/expansion as needed
Have flexibility targets (±100 pages)

Risk 2: Missing Dependencies

Risk: V2 assumes V1 knowledge not actually covered Mitigation:

Create prerequisite matrix
Add recap sections to V2 chapters
Review cross-references monthly

Risk 3: Timeline Slippage

Risk: New chapter writing takes longer than estimated Mitigation:

Prioritize essential chapters first
Have backup plan to defer Priority 3 chapters
Build 2-week buffer into timeline

Risk 4: Content Duplication

Risk: Same concept explained in both volumes Mitigation:

Clear "basic vs. advanced" delineation
V2 references V1 explicitly
Review for overlap in Month 6

📚 Reference Materials

Pedagogical Framework

V1 Narrative: Foundations → Building → Optimizing → Impact
V2 Narrative: Scale → Production → Responsibility
Connection: V1 ends with inspiration, V2 begins with bridge

Chapter Surgery Guidelines

Single-machine boundary: Keep in V1
Distributed systems: Move to V2
Production scale: Move to V2
Advanced optimization: Move to V2

Writing Standards

Timeless principles over current tech
Every chapter has: Purpose, Learning Outcomes, Summary
Concrete examples throughout
"Fallacies and Pitfalls" section

🔧 Tools and Workflow

Version Control

Create volume-split branch in Git
Track all changes in branch
Regular commits with clear messages

Organization

book/volume1/ - Volume 1 chapters
book/volume2/ - Volume 2 chapters
book/docs/ - All planning documents
book/extracted/ - Content extracted from V1

Quality Checks

Weekly page count tracking
Monthly cross-reference review
Technical accuracy spot checks
Pedagogical flow reviews

📞 Stakeholder Communication

MIT Press Updates

Monthly: Progress report with page counts
Major milestones: Notify when phases complete
Issues: Immediate communication of risks

Community/Reviewers

End Month 2: Share V1 draft for review
End Month 4: Share V2 draft chapters for review
End Month 5: Full review cycle

✅ Final Checklist (Month 6)

Volume 1 Completion

All 14 chapters present and coherent
Page count: 1,150-1,250 pages
All cross-references to V2 marked clearly
Exercises and quizzes updated
Figures and tables numbered correctly
Bibliography complete
Index prepared

Volume 2 Completion

All 15 chapters present and coherent
Page count: 1,100-1,200 pages
Bridge chapter effective
New chapters integrate extracted content
Exercises and quizzes complete
Figures and tables numbered correctly
Bibliography complete
Index prepared

Both Volumes

Consistent notation across volumes
No content duplication
Clear prerequisite chain
Professional copyedit complete
Ready for MIT Press submission

📊 Success Metrics

Quantitative

Volume 1: 1,150-1,250 pages ✓/✗
Volume 2: 1,100-1,200 pages ✓/✗
New content: 325-375 pages ✓/✗
Timeline: 6 months ✓/✗

Qualitative

Each volume independently valuable ✓/✗
Clear pedagogical progression ✓/✗
MIT Press approval ✓/✗
Reviewer feedback positive ✓/✗

🎓 Post-Completion

Publication Process

Submit to MIT Press
Incorporate editorial feedback
Final production review
Marketing materials
Course adoption outreach

Maintenance

Errata tracking system
Annual review cycle
Community feedback integration
Future edition planning

📝 Notes and Decisions

December 2024 - Project Launch

Decision: Committed to 6-month timeline
Decision: Will do full surgery, not quick split
Decision: Flagship quality is priority over speed
Next decision needed: [track decisions here]

Last Updated: December 7, 2024 Status: Planning Complete - Ready to Begin Execution Next Action: Begin Chapter 6 (Data Engineering) surgery - Week 1

This roadmap is the master coordination document for the two-volume split project. Update weekly with progress, decisions, and course corrections.

14 KiB Raw Blame History