Files
cs249r_book/book/docs/VOLUME_SPLIT_ROADMAP.md
Vijay Janapa Reddi bee262de3e docs: add comprehensive volume split planning documents
- Add MIT_PRESS_PROPOSAL.md: polished proposal for two-volume series
- Add VOLUME_SPLIT_ROADMAP.md: 6-month master project plan with progress tracking
- Add VOLUME_SPLIT_SURGICAL_PLAN.md: detailed section-by-section surgery instructions

These documents establish the complete roadmap for splitting the ML Systems
textbook into two flagship volumes for MIT Press publication.
2025-12-07 06:54:55 -08:00

14 KiB

Machine Learning Systems: Two-Volume Split - Master Roadmap

Project Start Date: December 2024 Target Completion: June 2025 (6 months) Goal: Create flagship two-volume ML Systems textbook series for MIT Press


📋 Project Documents

Core Planning Documents

  1. MIT_PRESS_PROPOSAL.md - Official proposal for MIT Press (ready to submit)
  2. VOLUME_SPLIT_SURGICAL_PLAN.md - Section-by-section surgery instructions (50+ pages)
  3. VOLUME_SPLIT_ROADMAP.md - This document: master project plan and progress tracker
  4. VOLUME_STRUCTURE_PROPOSAL.md - Original analysis and proposal

Supporting Documents

  • VOLUME_SPLIT_ANALYSIS.md - Deep analysis of pedagogical issues
  • VOLUME_SPLIT_EXECUTIVE_SUMMARY.md - Quick reference summary
  • DISTRIBUTED_CONTENT_ADDITIONS.md - Distributed awareness additions for V1

🎯 Project Vision

Volume I: Introduction to ML Systems (~1,150-1,200 pages)

  • Complete single-system ML engineering
  • Includes distributed awareness (not implementation)
  • Target: Undergrads, bootcamps, ML engineers entering the field

Volume II: Advanced ML Systems (~1,100-1,150 pages)

  • Distributed systems, production scale, responsibility
  • Built on timeless principles
  • Target: Graduate students, ML infrastructure engineers, senior practitioners

The One-Liner:

"Volume I teaches you to build ML systems that work; Volume II teaches you to build ML systems that scale."


📊 Current State (December 2024)

Existing Content

  • Total pages: 2,172 pages across 22 chapters
  • Status: Complete draft of comprehensive single-volume book
  • Quality: Refined through extensive review feedback

Content Distribution

  • Chapters 1-14: Form basis of Volume 1 (with surgery)
  • Chapters 15-21: Move to Volume 2
  • New content needed: 325-375 pages (8 new chapters for V2)

🗓️ Six-Month Timeline

Phase 1: Chapter Surgery (Months 1-2)

Goal: Extract distributed content from 7 chapters, create clean V1

Month 1: Content Extraction

  • Week 1-2: Extract distributed content from Chapters 6, 7, 8

    • Chapter 6 (Data Engineering): Extract 40 pages of distributed content
    • Chapter 7 (Frameworks): Extract 50 pages of distributed execution
    • Chapter 8 (Training): Extract 60 pages of distributed training
  • Week 3-4: Extract from Chapters 10, 11, 12, 13

    • Chapter 10 (Optimizations): Extract 80 pages (NAS, AutoML)
    • Chapter 11 (Hardware): Extract 50 pages (multi-chip)
    • Chapter 12 (Benchmarking): Extract 40 pages (distributed benchmarking)
    • Chapter 13 (MLOps): Extract 30 pages (production scale)

Month 2: Bridging and Polish

  • Week 5-6: Create V1 transitions

    • Add "See Volume 2" callout boxes in V1
    • Write brief distributed awareness sections
    • Ensure V1 chapters remain coherent
  • Week 7-8: Organize extracted content

    • Create V2 chapter structure
    • Place extracted content in appropriate V2 chapters
    • Identify gaps in extracted content

Phase 2: New Content Development (Months 3-5)

Month 3: Priority 1 Chapters (Essential Infrastructure)

  • Week 9-10: Memory & Storage

    • V2 Ch1: Memory Hierarchies for ML (45 pages)
    • V2 Ch2: Storage Systems for ML (40 pages)
  • Week 11-12: Communication & Distributed Training

    • V2 Ch3: Communication & Collective Operations (45 pages)
    • V2 Ch4: Distributed Training Systems (50 pages) - integrate extracted content

Month 4: Priority 2 Chapters (Production Requirements)

  • Week 13-14: Fault Tolerance & Inference

    • V2 Ch5: Fault Tolerance & Resilience (40 pages)
    • V2 Ch6: Inference at Scale (45 pages)
  • Week 15-16: Integration

    • Integrate all extracted content into new chapters
    • Write chapter introductions and conclusions
    • Create cross-references

Month 5: Priority 3 Chapters (Specialized Topics)

  • Week 17-18: Edge Systems

    • V2 Ch8: Edge Intelligence Systems (50 pages)
    • Integrate extracted edge content from Ch2
  • Week 19-20: Final new chapters

    • V2 Ch1: Bridge Chapter (30 pages) - From Single to Distributed
    • Update existing V2 chapters (15-20) with introductions

Phase 3: Integration and Polish (Month 6)

Month 6: Final Integration

  • Week 21-22: Cross-References and Consistency

    • Update all V1→V2 cross-references
    • Update all V2→V1 prerequisite references
    • Ensure consistent notation across volumes
    • Verify all figure references work
  • Week 23: Narrative Flow

    • Review V1 narrative arc (Foundations → Building → Optimizing → Impact)
    • Review V2 narrative arc (Scale → Production → Responsibility)
    • Polish chapter transitions
    • Write volume prefaces
  • Week 24: Final Quality Checks

    • Technical accuracy review
    • Page count verification
    • Exercise and quiz consistency
    • Final copyedit pass
    • Prepare camera-ready manuscripts

📈 Progress Tracking

Volume 1 Progress (Target: 1,150-1,200 pages)

Chapter Current Target Surgery Status Notes
1. Introduction 90 60 Not Started Compress history section
2. ML Systems 70 70 Not Started Extract hybrid architectures
3. DL Primer 110 100 Not Started No surgery (keep as-is)
4. DNN Architectures 82 100 Not Started No surgery (keep as-is)
5. Workflow 51 40 Not Started Minor compression
6. Data Engineering 138 80 Not Started Extract distributed storage
7. Frameworks 121 100 Not Started Extract distributed execution
8. Training 157 100 Not Started Extract distributed training
9. Efficient AI 52 60 Not Started No surgery (keep as-is)
10. Optimizations 160 120 Not Started Extract NAS, AutoML
11. Hardware 181 90 Not Started Extract multi-chip
12. Benchmarking 124 80 Not Started Extract distributed benchmarking
13. MLOps 126 50 Not Started Extract production scale
14. AI for Good 84 50 Not Started Minor compression
TOTAL 1,546 1,100 0% V1 baseline complete

Progress Key: Not Started | 🟨 In Progress | Complete

Volume 2 Progress (Target: 1,100-1,150 pages)

Chapter Source Target Status Notes
1. Bridge: Single to Distributed NEW 30 Not Started Write from scratch
2. Memory Hierarchies NEW + Ch11 45 Not Started New content + extracts
3. Storage Systems NEW + Ch6 40 Not Started New content + extracts
4. Communication & Collectives NEW + Ch8 45 Not Started New content + extracts
5. Distributed Training Ch8 + Ch10 50 Not Started Consolidate extracts
6. Fault Tolerance NEW + Ch13 40 Not Started New content + extracts
7. Inference at Scale NEW + Ch2 + Ch13 45 Not Started New content + extracts
8. Edge Intelligence NEW + Ch2 50 Not Started New content + extracts
9. On-Device Learning Ch14 (existing) 127 Not Started Move from V1
10. Privacy Systems Ch15 (split) 65 Not Started Split Privacy/Security
11. Security Systems Ch15 (split) 68 Not Started Split Privacy/Security
12. Robust AI Ch16 (existing) 137 Not Started Move from V1
13. Responsible AI Ch17 (existing) 135 Not Started Move from V1
14. Sustainable AI Ch18 (existing) 46 Not Started Move from V1
15. Frontiers & AGI Ch19+20 (merge) 78 Not Started Merge two chapters
TOTAL Mixed 1,001 0% Need ~100 more pages

New Content Writing Progress (325-375 pages needed)

Chapter Pages Draft Review Final Notes
Bridge Chapter 30 Priority 1
Memory Hierarchies 45 Priority 1
Storage Systems 40 Priority 1
Communication 45 Priority 2
Distributed Training 50 Priority 1 (+ extracts)
Fault Tolerance 40 Priority 2
Inference at Scale 45 Priority 1
Edge Intelligence 50 Priority 3
TOTAL NEW 345 0% 0% 0% 8 new chapters

📝 Weekly Progress Log

Week 1 (Dec 2-8, 2024)

  • Created comprehensive surgical plan
  • Created MIT Press proposal
  • Created master roadmap
  • Next: Begin Chapter 6 surgery

Week 2 (Dec 9-15, 2024)

  • Progress notes...

Week 3 (Dec 16-22, 2024)

  • Progress notes...

(Continue weekly log throughout 6-month period)


🎯 Key Milestones

  • End Month 1: All distributed content extracted from V1 chapters
  • End Month 2: V1 chapters coherent and complete
  • End Month 3: Priority 1 new chapters drafted (160 pages)
  • End Month 4: Priority 2 new chapters drafted (85 pages)
  • End Month 5: All new content drafted (345 pages)
  • End Month 6: Camera-ready manuscripts for both volumes

⚠️ Risks and Mitigation

Risk 1: Page Count Imbalance

Risk: V1 or V2 ends up significantly larger/smaller than target Mitigation:

  • Monitor page counts weekly
  • Adjust compression/expansion as needed
  • Have flexibility targets (±100 pages)

Risk 2: Missing Dependencies

Risk: V2 assumes V1 knowledge not actually covered Mitigation:

  • Create prerequisite matrix
  • Add recap sections to V2 chapters
  • Review cross-references monthly

Risk 3: Timeline Slippage

Risk: New chapter writing takes longer than estimated Mitigation:

  • Prioritize essential chapters first
  • Have backup plan to defer Priority 3 chapters
  • Build 2-week buffer into timeline

Risk 4: Content Duplication

Risk: Same concept explained in both volumes Mitigation:

  • Clear "basic vs. advanced" delineation
  • V2 references V1 explicitly
  • Review for overlap in Month 6

📚 Reference Materials

Pedagogical Framework

  • V1 Narrative: Foundations → Building → Optimizing → Impact
  • V2 Narrative: Scale → Production → Responsibility
  • Connection: V1 ends with inspiration, V2 begins with bridge

Chapter Surgery Guidelines

  • Single-machine boundary: Keep in V1
  • Distributed systems: Move to V2
  • Production scale: Move to V2
  • Advanced optimization: Move to V2

Writing Standards

  • Timeless principles over current tech
  • Every chapter has: Purpose, Learning Outcomes, Summary
  • Concrete examples throughout
  • "Fallacies and Pitfalls" section

🔧 Tools and Workflow

Version Control

  • Create volume-split branch in Git
  • Track all changes in branch
  • Regular commits with clear messages

Organization

  • book/volume1/ - Volume 1 chapters
  • book/volume2/ - Volume 2 chapters
  • book/docs/ - All planning documents
  • book/extracted/ - Content extracted from V1

Quality Checks

  • Weekly page count tracking
  • Monthly cross-reference review
  • Technical accuracy spot checks
  • Pedagogical flow reviews

📞 Stakeholder Communication

MIT Press Updates

  • Monthly: Progress report with page counts
  • Major milestones: Notify when phases complete
  • Issues: Immediate communication of risks

Community/Reviewers

  • End Month 2: Share V1 draft for review
  • End Month 4: Share V2 draft chapters for review
  • End Month 5: Full review cycle

Final Checklist (Month 6)

Volume 1 Completion

  • All 14 chapters present and coherent
  • Page count: 1,150-1,250 pages
  • All cross-references to V2 marked clearly
  • Exercises and quizzes updated
  • Figures and tables numbered correctly
  • Bibliography complete
  • Index prepared

Volume 2 Completion

  • All 15 chapters present and coherent
  • Page count: 1,100-1,200 pages
  • Bridge chapter effective
  • New chapters integrate extracted content
  • Exercises and quizzes complete
  • Figures and tables numbered correctly
  • Bibliography complete
  • Index prepared

Both Volumes

  • Consistent notation across volumes
  • No content duplication
  • Clear prerequisite chain
  • Professional copyedit complete
  • Ready for MIT Press submission

📊 Success Metrics

Quantitative

  • Volume 1: 1,150-1,250 pages ✓/✗
  • Volume 2: 1,100-1,200 pages ✓/✗
  • New content: 325-375 pages ✓/✗
  • Timeline: 6 months ✓/✗

Qualitative

  • Each volume independently valuable ✓/✗
  • Clear pedagogical progression ✓/✗
  • MIT Press approval ✓/✗
  • Reviewer feedback positive ✓/✗

🎓 Post-Completion

Publication Process

  • Submit to MIT Press
  • Incorporate editorial feedback
  • Final production review
  • Marketing materials
  • Course adoption outreach

Maintenance

  • Errata tracking system
  • Annual review cycle
  • Community feedback integration
  • Future edition planning

📝 Notes and Decisions

December 2024 - Project Launch

  • Decision: Committed to 6-month timeline
  • Decision: Will do full surgery, not quick split
  • Decision: Flagship quality is priority over speed
  • Next decision needed: [track decisions here]

Last Updated: December 7, 2024 Status: Planning Complete - Ready to Begin Execution Next Action: Begin Chapter 6 (Data Engineering) surgery - Week 1


This roadmap is the master coordination document for the two-volume split project. Update weekly with progress, decisions, and course corrections.