mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-09 02:11:56 -05:00
Improve AI training chapter #309
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @profvjreddi on GitHub (Jan 21, 2025).
Originally assigned to: @profvjreddi on GitHub.
Need to rewrite the training chapter. The original version that we have is more about general AI training with a few sprinkles of ML systems and instead I want to rewrite the chapter to really take an ML systems perspective on training.
@profvjreddi commented on GitHub (Jan 21, 2025):
Tentative draft will see how it comes out.
Chapter 8: AI Training
8.1 Introduction to AI Training
8.2 Data and Preprocessing for Training
8.3 Walkthrough: Training a Simple Network on MNIST
8.4 Training Workflows and Architectures
8.5 Training Optimizations from a System Perspective
8.5.1 Mixed Precision Training
8.5.2 Parallelism Strategies
8.5.3 Batch Size and Gradient Accumulation
8.5.4 Optimization Algorithms and System Trade-offs
8.5.5 Communication Overhead in Distributed Training
8.6 Bridging Algorithms to Hardware
8.6.1 Training Frameworks and Libraries
8.6.2 Exploiting Hardware-Specific Features
8.6.3 Profiling and Debugging System Performance
8.7 Case Studies: Training in Practice
8.7.1 Scaling MNIST at Scale
8.7.2 Training Large Language Models
8.7.3 Efficient Training for Computer Vision
8.8 Conclusion
8.8.1 Recap of Training as a System Process
8.8.2 Mitigating System Bottlenecks
8.8.3 Preparing for Scalable and Efficient ML Systems
@profvjreddi commented on GitHub (Jan 23, 2025):
I copied in some class notes and have been pushing those updates.