mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-30 01:29:07 -05:00
docs: add volume structure proposal and distribution design docs
- book/docs/VOLUME_STRUCTURE_PROPOSAL.md: Proposal for textbook volume structure - tinytorch/docs/DISTRIBUTION_DESIGN.md: Design document for TinyTorch pip distribution
This commit is contained in:
214
book/docs/VOLUME_STRUCTURE_PROPOSAL.md
Normal file
214
book/docs/VOLUME_STRUCTURE_PROPOSAL.md
Normal file
@@ -0,0 +1,214 @@
|
||||
# Machine Learning Systems: Two-Volume Structure
|
||||
|
||||
**Proposal for MIT Press**
|
||||
*Draft: December 2024*
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The *Machine Learning Systems* textbook will be published as two complementary volumes of 14 chapters each:
|
||||
|
||||
| Volume | Title | Focus | Chapters |
|
||||
|--------|-------|-------|----------|
|
||||
| **Volume 1** | Introduction to Machine Learning Systems | Complete ML lifecycle, single-system focus | 14 (all existing) |
|
||||
| **Volume 2** | Advanced Machine Learning Systems | Principles of scale, distribution, and production | 14 (6 existing, 8 new) |
|
||||
|
||||
**Guiding Philosophy:**
|
||||
- **Volume 1**: Everything you need to build ML systems on a single machine, ending on a positive note with societal impact
|
||||
- **Volume 2**: Timeless principles for operating ML systems at scale, grounded in physics and mathematics rather than current technologies
|
||||
|
||||
---
|
||||
|
||||
## Volume 1: Introduction to Machine Learning Systems
|
||||
|
||||
*The complete ML lifecycle: understand it, build it, optimize it, deploy it, use it for good.*
|
||||
|
||||
| Part | Chapter | Description |
|
||||
|------|---------|-------------|
|
||||
| **Part I: Systems Foundations** | | *What are ML systems?* |
|
||||
| | 1. Introduction | Motivation and scope |
|
||||
| | 2. ML Systems | System-level view of machine learning |
|
||||
| | 3. Deep Learning Primer | Neural network fundamentals |
|
||||
| | 4. DNN Architectures | Modern architecture patterns |
|
||||
| **Part II: Design Principles** | | *How do you build ML systems?* |
|
||||
| | 5. Workflow | End-to-end ML pipeline design |
|
||||
| | 6. Data Engineering | Data collection, processing, validation |
|
||||
| | 7. Frameworks | PyTorch, TensorFlow, JAX ecosystem |
|
||||
| | 8. Training | Training loops, hyperparameters, convergence |
|
||||
| **Part III: Performance Engineering** | | *How do you make ML systems fast?* |
|
||||
| | 9. Efficient AI | Efficiency principles and metrics |
|
||||
| | 10. Optimizations | Quantization, pruning, distillation |
|
||||
| | 11. Hardware Acceleration | GPUs, TPUs, custom accelerators |
|
||||
| | 12. Benchmarking | Measurement, MLPerf, evaluation methodology |
|
||||
| **Part IV: Practice & Impact** | | *How do you deploy and use ML systems responsibly?* |
|
||||
| | 13. ML Operations | Deployment, monitoring, CI/CD for ML |
|
||||
| | 14. AI for Good | Positive societal applications |
|
||||
|
||||
**Total: 14 chapters across 4 parts (all existing content)**
|
||||
|
||||
### Volume 1 Narrative Arc
|
||||
|
||||
The book progresses from understanding → building → optimizing → deploying → impact:
|
||||
|
||||
1. **Foundations** establish what ML systems are and why they matter
|
||||
2. **Design** teaches how to construct complete pipelines
|
||||
3. **Performance** shows how to make systems efficient
|
||||
4. **Practice & Impact** completes the lifecycle and ends on an inspirational note
|
||||
|
||||
Ending on "AI for Good" leaves students with a positive vision of what they can build.
|
||||
|
||||
---
|
||||
|
||||
## Volume 2: Advanced Machine Learning Systems
|
||||
|
||||
*Timeless principles for building and operating ML systems at scale.*
|
||||
|
||||
| Part | Chapter | Status | Description |
|
||||
|------|---------|--------|-------------|
|
||||
| **Part I: Data Movement & Memory** | | | *Moving data is the bottleneck* |
|
||||
| | 1. Memory Hierarchies for ML | 🆕 NEW | GPU memory, HBM, activation checkpointing |
|
||||
| | 2. Storage Systems for ML | 🆕 NEW | Distributed storage, checkpointing, feature stores |
|
||||
| | 3. Communication & Collective Operations | 🆕 NEW | AllReduce, gradient compression, network topology |
|
||||
| **Part II: Parallelism & Coordination** | | | *Decomposing computation across machines* |
|
||||
| | 4. Distributed Training | 🆕 NEW | Data/model/pipeline/tensor parallelism |
|
||||
| | 5. Fault Tolerance & Recovery | 🆕 NEW | Checkpointing, elastic training, failure handling |
|
||||
| | 6. Inference Systems | 🆕 NEW | Batching, serving architectures, autoscaling |
|
||||
| **Part III: Constrained Environments** | | | *Doing more with less* |
|
||||
| | 7. On-device Learning | Existing | Training and adaptation on edge devices |
|
||||
| | 8. Edge Deployment | 🆕 NEW | Compilation, runtime optimization, real-time |
|
||||
| **Part IV: Adversarial Environments** | | | *Systems under attack and uncertainty* |
|
||||
| | 9. Privacy in ML Systems | Existing | Differential privacy, federated learning, secure aggregation |
|
||||
| | 10. Security in ML Systems | 🆕 NEW | Supply chain, API security, multi-tenant isolation |
|
||||
| | 11. Robust AI | Existing | Adversarial robustness, distribution shift, monitoring |
|
||||
| **Part V: Responsible Systems** | | | *Building systems that serve humanity* |
|
||||
| | 12. Responsible AI | Existing | Fairness, accountability, transparency at scale |
|
||||
| | 13. Sustainable AI | Existing | Energy efficiency, carbon footprint, environmental impact |
|
||||
| | 14. Frontiers & Future Directions | Existing | Emerging paradigms, open problems, conclusion |
|
||||
|
||||
**Total: 14 chapters across 5 parts (6 existing, 8 new)**
|
||||
|
||||
---
|
||||
|
||||
## New Content for Volume 2
|
||||
|
||||
### Part I: Data Movement & Memory
|
||||
*The physics of data movement is the fundamental constraint in modern ML.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Memory Hierarchies for ML** | GPU memory management, HBM architecture, caching strategies, activation checkpointing, memory-efficient attention | Memory bandwidth limits compute utilization |
|
||||
| **Storage Systems for ML** | Distributed file systems, checkpoint I/O, feature stores, data lakes, prefetching, I/O scheduling | Storage throughput gates training speed |
|
||||
| **Communication & Collective Operations** | AllReduce algorithms, ring/tree topologies, gradient compression, RDMA fundamentals, network topology design | Communication overhead limits scaling |
|
||||
|
||||
### Part II: Parallelism & Coordination
|
||||
*The mathematics of decomposing work across machines.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Distributed Training** | Data parallelism, model parallelism (tensor, pipeline, expert), hybrid strategies, synchronization, load balancing | Parallelism has fundamental trade-offs |
|
||||
| **Fault Tolerance & Recovery** | Checkpoint strategies, async checkpointing, elastic training, failure detection, graceful degradation | Large systems fail; recovery must be designed in |
|
||||
| **Inference Systems** | Batching strategies, continuous batching, KV cache management, model serving patterns, autoscaling, SLO management | Serving has different constraints than training |
|
||||
|
||||
### Part III: Constrained Environments
|
||||
*Operating under resource limitations.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Edge Deployment** | Model compilation, runtime optimization, heterogeneous hardware, real-time constraints, power management | Constraints force creativity |
|
||||
|
||||
### Part IV: Adversarial Environments
|
||||
*Systems facing attacks, privacy requirements, and uncertainty.*
|
||||
|
||||
| Chapter | Key Topics | Timeless Principle |
|
||||
|---------|------------|-------------------|
|
||||
| **Security in ML Systems** | Model provenance, supply chain security, API protection, multi-tenant isolation, access control | Production systems face adversaries |
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
### Why This Structure Works
|
||||
|
||||
**Volume 1 (Single System)**
|
||||
- Teaches the complete lifecycle
|
||||
- Everything can be learned and practiced on one machine
|
||||
- Ends positively with societal impact
|
||||
|
||||
**Volume 2 (Distributed Systems)**
|
||||
- Builds on Volume 1 foundations
|
||||
- Addresses what changes at scale
|
||||
- Organized around timeless constraints, not current technologies
|
||||
|
||||
### What Makes Volume 2 Timeless
|
||||
|
||||
Each part addresses constraints rooted in physics, mathematics, or human nature:
|
||||
|
||||
| Part | Eternal Constraint | Foundation |
|
||||
|------|-------------------|------------|
|
||||
| Data Movement & Memory | Moving data costs more than compute | Physics: speed of light, memory bandwidth |
|
||||
| Parallelism & Coordination | Work must be decomposed and synchronized | Mathematics of parallel computation |
|
||||
| Constrained Environments | Resources are always finite | Economics and physics |
|
||||
| Adversarial Environments | Attackers and uncertainty exist | Human nature, statistics |
|
||||
| Responsible Systems | Technology must serve humanity | Ethics, sustainability |
|
||||
|
||||
Chapters use current examples (LLMs, transformers, specific hardware) but frame them as instances of these enduring principles.
|
||||
|
||||
---
|
||||
|
||||
## Content Migration Summary
|
||||
|
||||
| Chapter | Volume 1 | Volume 2 | Rationale |
|
||||
|---------|----------|----------|-----------|
|
||||
| Introduction through Benchmarking | ✓ | | Core technical content |
|
||||
| ML Operations | ✓ | | Completes the lifecycle |
|
||||
| AI for Good | ✓ | | Positive conclusion |
|
||||
| On-device Learning | | ✓ | Edge/constrained is advanced |
|
||||
| Privacy & Security | | ✓ | Production security is advanced |
|
||||
| Robust AI | | ✓ | Production robustness is advanced |
|
||||
| Responsible AI | | ✓ | Scale changes the challenges |
|
||||
| Sustainable AI | | ✓ | Datacenter scale is advanced |
|
||||
| Frontiers | | ✓ | Conclusion for advanced volume |
|
||||
|
||||
---
|
||||
|
||||
## Audience
|
||||
|
||||
| Volume | Primary Audience | Use Cases |
|
||||
|--------|-----------------|-----------|
|
||||
| Volume 1 | All ML practitioners, undergraduates, bootcamp students | First course in ML systems, self-study |
|
||||
| Volume 2 | Infrastructure engineers, graduate students, researchers | Advanced course, reference for practitioners at scale |
|
||||
|
||||
---
|
||||
|
||||
## Collaboration Model
|
||||
|
||||
Volume 2's new chapters are candidates for collaborative authorship:
|
||||
|
||||
| Topic Area | Ideal Collaborator Profile |
|
||||
|------------|---------------------------|
|
||||
| Memory & Storage | Datacenter architects, MLPerf Storage contributors |
|
||||
| Networking & Communication | Distributed systems researchers, framework developers |
|
||||
| Distributed Training | PyTorch/JAX distributed teams, hyperscaler engineers |
|
||||
| Fault Tolerance | Site reliability engineers, systems researchers |
|
||||
| Inference Systems | ML serving infrastructure engineers |
|
||||
| Edge Deployment | Embedded ML practitioners, compiler engineers |
|
||||
| Security | ML security researchers, production security engineers |
|
||||
|
||||
---
|
||||
|
||||
## Summary Statistics
|
||||
|
||||
| Metric | Volume 1 | Volume 2 |
|
||||
|--------|----------|----------|
|
||||
| Chapters | 14 | 14 |
|
||||
| Parts | 4 | 5 |
|
||||
| Existing content | 14 | 6 |
|
||||
| New content | 0 | 8 |
|
||||
| Focus | Single system | Distributed systems |
|
||||
| Prerequisite | None | Volume 1 |
|
||||
|
||||
---
|
||||
|
||||
*Document Version: December 2024*
|
||||
*For discussion with MIT Press and potential collaborators*
|
||||
Reference in New Issue
Block a user