Improve AI training chapter #309

New Issue

GiteaMirror · 2026-03-22T15:36:17-05:00

GiteaMirror commented

2026-03-22 15:36:17 -05:00

Originally created by @profvjreddi on GitHub (Jan 21, 2025).

Originally assigned to: @profvjreddi on GitHub.

Need to rewrite the training chapter. The original version that we have is more about general AI training with a few sprinkles of ML systems and instead I want to rewrite the chapter to really take an ML systems perspective on training.

Originally created by @profvjreddi on GitHub (Jan 21, 2025). Originally assigned to: @profvjreddi on GitHub. Need to rewrite the training chapter. The original version that we have is more about general AI training with a few sprinkles of ML systems and instead I want to rewrite the chapter to really take an ML systems perspective on training.

GiteaMirror added the area: book type: improvement labels 2026-03-22 15:36:17 -05:00

GiteaMirror closed this issue

2026-03-22 15:36:18 -05:00

GiteaMirror commented

2026-03-22 15:36:18 -05:00

@profvjreddi commented on GitHub (Jan 21, 2025):

Tentative draft will see how it comes out.

Chapter 8: AI Training

8.1 Introduction to AI Training

8.1.1 Why Systems Matter in Training
- Training as a system-driven process, beyond algorithmic convergence.
- Bridging the gap between training workflows and hardware execution.
8.1.2 Overview of the Training Pipeline
- Key steps: Forward pass, backward pass, and parameter updates.
- System considerations: Compute, memory, data flow, and communication.
8.1.3 Chapter Goals
- Understand training workflows from a systems perspective.
- Explore system bottlenecks, hyperparameter trade-offs, and scaling optimizations.

8.2 Data and Preprocessing for Training

8.2.1 Importance of Data in Training
- Characteristics of high-quality training data: Diversity, representativeness, and balance.
- Bias/variance trade-off: The role of data in generalization and system efficiency.
8.2.2 Creating and Managing Data Splits
- Training, validation, and test splits: Definitions and best practices.
- Avoiding common pitfalls:
  - Data leakage between splits.
  - Failing to stratify or account for time-series dependencies.
8.2.3 Data Pipelines in Training
- Efficient data loading: Preprocessing, augmentation, caching, and shuffling.
- System considerations:
  - Disk I/O, memory usage, and batching.
  - Batch size and its effect on data pipeline performance.

8.3 Walkthrough: Training a Simple Network on MNIST

8.3.1 Problem Setup
- Example: Classifying handwritten digits with a feedforward network.
- Dataset: Overview of MNIST and its relevance.
- Model architecture: Input layer, one hidden layer, output layer.
8.3.2 Step-by-Step Training Process
- Forward pass: Computational implications of matrix multiplications.
- Backward pass: Gradient computations and memory demands.
- Parameter updates:
  - Optimizer choices (e.g., SGD, Adam) and their system implications.
  - Adam’s additional memory overhead compared to simpler optimizers.
- Regularization techniques (e.g., dropout) and their impact on compute overhead.
8.3.3 Key System Considerations
- Compute: FLOPs required for forward and backward passes.
- Memory: Activations, gradients, and parameter storage.
- Data movement: Efficiently loading batches and minimizing I/O bottlenecks.
8.3.4 Scaling the Example
- Challenges of scaling MNIST with larger datasets or deeper networks.
- Weight initialization (e.g., Xavier, He) and its effect on convergence speed.
- Identifying bottlenecks in compute, memory, and data pipelines.

8.4 Training Workflows and Architectures

8.4.1 Compute Infrastructure for Training
- CPUs: General-purpose compute for preprocessing and small-scale tasks.
- GPUs and TPUs: Accelerating deep learning through parallelism.
- Specialized hardware: FPGAs and ASICs for ML workloads.
8.4.2 Memory and Bandwidth Management
- Memory allocation for activations, gradients, and parameters.
- Bandwidth bottlenecks in distributed setups.
- Strategies: Gradient checkpointing, memory reuse, and overlapping compute/data transfers.
8.4.3 Distributed Training Architectures
- Data parallelism: Splitting datasets across devices.
- Model parallelism: Splitting model layers across devices.
- Pipeline parallelism: Partitioning forward and backward passes into stages.
- Synchronization and communication trade-offs in parallel systems.

8.5 Training Optimizations from a System Perspective

8.5.1 Mixed Precision Training

Using FP16/FP32 for memory savings and faster computation.
Hardware support (e.g., GPUs, TPUs) for mixed precision training.

8.5.2 Parallelism Strategies

Data, model, and pipeline parallelism: Practical applications and limitations.
Interaction of hyperparameters (e.g., batch size) with parallelism:
- Larger batches reduce synchronization costs but require more memory.

8.5.3 Batch Size and Gradient Accumulation

Batch size trade-offs:
- Large batches improve hardware utilization but strain memory and I/O.
- Small batches reduce memory needs but increase overall training time.
Gradient accumulation: Simulating larger batches on memory-limited systems.

8.5.4 Optimization Algorithms and System Trade-offs

Optimizer impacts on system performance:
- Adam’s memory requirements vs. SGD’s lower overhead.
- Compute costs of momentum-based optimizers.
System implications of learning rate schedules (e.g., warmup, decay).

8.5.5 Communication Overhead in Distributed Training

Synchronizing gradients across nodes (e.g., AllReduce).
Techniques to minimize communication delays:
- Overlapping compute and communication.
- Asynchronous training for large-scale setups.

8.6 Bridging Algorithms to Hardware

8.6.1 Training Frameworks and Libraries

TensorFlow, PyTorch, and JAX for high-level training workflows.
Hardware-specific libraries: cuDNN, MKL-DNN, ROCm for GPUs.

8.6.2 Exploiting Hardware-Specific Features

GPU tensor cores and TPU optimizations for matrix multiplications.
Optimizer-specific impacts:
- Adam’s memory overhead on GPUs vs. simpler optimizers like SGD.

8.6.3 Profiling and Debugging System Performance

Profiling tools: NVIDIA Nsight, PyTorch Profiler, TensorBoard.
Diagnosing bottlenecks in compute, memory, and data pipelines.

8.7 Case Studies: Training in Practice

8.7.1 Scaling MNIST at Scale

Scaling a simple workload across GPUs/TPUs.
System lessons: Bottlenecks in data pipelines, compute, and memory.

8.7.2 Training Large Language Models

System challenges in transformer-based architectures.
Memory optimization techniques: Sparse attention and tensor sharding.

8.7.3 Efficient Training for Computer Vision

Data augmentation pipelines and their impact on throughput.
Balancing compute and memory trade-offs in convolutional models.

8.8 Conclusion

8.8.1 Recap of Training as a System Process
- Key components: Data pipelines, compute, memory, and communication.
8.8.2 Mitigating System Bottlenecks
- Strategies for optimizing hyperparameters, optimizers, and resource allocation.
8.8.3 Preparing for Scalable and Efficient ML Systems
- Linking training workflows to deployment and production systems.

@profvjreddi commented on GitHub (Jan 21, 2025): Tentative draft will see how it comes out. --- ### Chapter 8: AI Training #### 8.1 Introduction to AI Training - **8.1.1 Why Systems Matter in Training** - Training as a system-driven process, beyond algorithmic convergence. - Bridging the gap between training workflows and hardware execution. - **8.1.2 Overview of the Training Pipeline** - Key steps: Forward pass, backward pass, and parameter updates. - System considerations: Compute, memory, data flow, and communication. - **8.1.3 Chapter Goals** - Understand training workflows from a systems perspective. - Explore system bottlenecks, hyperparameter trade-offs, and scaling optimizations. --- #### 8.2 Data and Preprocessing for Training - **8.2.1 Importance of Data in Training** - Characteristics of high-quality training data: Diversity, representativeness, and balance. - Bias/variance trade-off: The role of data in generalization and system efficiency. - **8.2.2 Creating and Managing Data Splits** - Training, validation, and test splits: Definitions and best practices. - Avoiding common pitfalls: - Data leakage between splits. - Failing to stratify or account for time-series dependencies. - **8.2.3 Data Pipelines in Training** - Efficient data loading: Preprocessing, augmentation, caching, and shuffling. - System considerations: - Disk I/O, memory usage, and batching. - Batch size and its effect on data pipeline performance. --- #### 8.3 Walkthrough: Training a Simple Network on MNIST - **8.3.1 Problem Setup** - Example: Classifying handwritten digits with a feedforward network. - Dataset: Overview of MNIST and its relevance. - Model architecture: Input layer, one hidden layer, output layer. - **8.3.2 Step-by-Step Training Process** - Forward pass: Computational implications of matrix multiplications. - Backward pass: Gradient computations and memory demands. - Parameter updates: - Optimizer choices (e.g., SGD, Adam) and their system implications. - Adam’s additional memory overhead compared to simpler optimizers. - Regularization techniques (e.g., dropout) and their impact on compute overhead. - **8.3.3 Key System Considerations** - Compute: FLOPs required for forward and backward passes. - Memory: Activations, gradients, and parameter storage. - Data movement: Efficiently loading batches and minimizing I/O bottlenecks. - **8.3.4 Scaling the Example** - Challenges of scaling MNIST with larger datasets or deeper networks. - Weight initialization (e.g., Xavier, He) and its effect on convergence speed. - Identifying bottlenecks in compute, memory, and data pipelines. --- #### 8.4 Training Workflows and Architectures - **8.4.1 Compute Infrastructure for Training** - CPUs: General-purpose compute for preprocessing and small-scale tasks. - GPUs and TPUs: Accelerating deep learning through parallelism. - Specialized hardware: FPGAs and ASICs for ML workloads. - **8.4.2 Memory and Bandwidth Management** - Memory allocation for activations, gradients, and parameters. - Bandwidth bottlenecks in distributed setups. - Strategies: Gradient checkpointing, memory reuse, and overlapping compute/data transfers. - **8.4.3 Distributed Training Architectures** - Data parallelism: Splitting datasets across devices. - Model parallelism: Splitting model layers across devices. - Pipeline parallelism: Partitioning forward and backward passes into stages. - Synchronization and communication trade-offs in parallel systems. --- #### 8.5 Training Optimizations from a System Perspective **8.5.1 Mixed Precision Training** - Using FP16/FP32 for memory savings and faster computation. - Hardware support (e.g., GPUs, TPUs) for mixed precision training. **8.5.2 Parallelism Strategies** - Data, model, and pipeline parallelism: Practical applications and limitations. - Interaction of hyperparameters (e.g., batch size) with parallelism: - Larger batches reduce synchronization costs but require more memory. **8.5.3 Batch Size and Gradient Accumulation** - Batch size trade-offs: - Large batches improve hardware utilization but strain memory and I/O. - Small batches reduce memory needs but increase overall training time. - Gradient accumulation: Simulating larger batches on memory-limited systems. **8.5.4 Optimization Algorithms and System Trade-offs** - Optimizer impacts on system performance: - Adam’s memory requirements vs. SGD’s lower overhead. - Compute costs of momentum-based optimizers. - System implications of learning rate schedules (e.g., warmup, decay). **8.5.5 Communication Overhead in Distributed Training** - Synchronizing gradients across nodes (e.g., AllReduce). - Techniques to minimize communication delays: - Overlapping compute and communication. - Asynchronous training for large-scale setups. --- #### 8.6 Bridging Algorithms to Hardware **8.6.1 Training Frameworks and Libraries** - TensorFlow, PyTorch, and JAX for high-level training workflows. - Hardware-specific libraries: cuDNN, MKL-DNN, ROCm for GPUs. **8.6.2 Exploiting Hardware-Specific Features** - GPU tensor cores and TPU optimizations for matrix multiplications. - Optimizer-specific impacts: - Adam’s memory overhead on GPUs vs. simpler optimizers like SGD. **8.6.3 Profiling and Debugging System Performance** - Profiling tools: NVIDIA Nsight, PyTorch Profiler, TensorBoard. - Diagnosing bottlenecks in compute, memory, and data pipelines. --- #### 8.7 Case Studies: Training in Practice **8.7.1 Scaling MNIST at Scale** - Scaling a simple workload across GPUs/TPUs. - System lessons: Bottlenecks in data pipelines, compute, and memory. **8.7.2 Training Large Language Models** - System challenges in transformer-based architectures. - Memory optimization techniques: Sparse attention and tensor sharding. **8.7.3 Efficient Training for Computer Vision** - Data augmentation pipelines and their impact on throughput. - Balancing compute and memory trade-offs in convolutional models. --- #### 8.8 Conclusion - **8.8.1 Recap of Training as a System Process** - Key components: Data pipelines, compute, memory, and communication. - **8.8.2 Mitigating System Bottlenecks** - Strategies for optimizing hyperparameters, optimizers, and resource allocation. - **8.8.3 Preparing for Scalable and Efficient ML Systems** - Linking training workflows to deployment and production systems.

GiteaMirror commented

2026-03-22 15:36:18 -05:00

@profvjreddi commented on GitHub (Jan 23, 2025):

I copied in some class notes and have been pushing those updates.

@profvjreddi commented on GitHub (Jan 23, 2025): I copied in some class notes and have been pushing those updates.

GiteaMirror referenced this issue

2026-04-11 07:44:43 -05:00

[GH-ISSUE #309] Remove periods from slide links or other links in resources #1354

GiteaMirror referenced this issue

2026-04-19 12:01:39 -05:00

[GH-ISSUE #309] Remove periods from slide links or other links in resources #3974

GiteaMirror referenced this issue