Improve AI training chapter #309

Closed
opened 2026-03-22 15:36:17 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @profvjreddi on GitHub (Jan 21, 2025).

Originally assigned to: @profvjreddi on GitHub.

Need to rewrite the training chapter. The original version that we have is more about general AI training with a few sprinkles of ML systems and instead I want to rewrite the chapter to really take an ML systems perspective on training.

Originally created by @profvjreddi on GitHub (Jan 21, 2025). Originally assigned to: @profvjreddi on GitHub. Need to rewrite the training chapter. The original version that we have is more about general AI training with a few sprinkles of ML systems and instead I want to rewrite the chapter to really take an ML systems perspective on training.
GiteaMirror added the area: booktype: improvement labels 2026-03-22 15:36:17 -05:00
Author
Owner

@profvjreddi commented on GitHub (Jan 21, 2025):

Tentative draft will see how it comes out.


Chapter 8: AI Training

8.1 Introduction to AI Training

  • 8.1.1 Why Systems Matter in Training
    • Training as a system-driven process, beyond algorithmic convergence.
    • Bridging the gap between training workflows and hardware execution.
  • 8.1.2 Overview of the Training Pipeline
    • Key steps: Forward pass, backward pass, and parameter updates.
    • System considerations: Compute, memory, data flow, and communication.
  • 8.1.3 Chapter Goals
    • Understand training workflows from a systems perspective.
    • Explore system bottlenecks, hyperparameter trade-offs, and scaling optimizations.

8.2 Data and Preprocessing for Training

  • 8.2.1 Importance of Data in Training
    • Characteristics of high-quality training data: Diversity, representativeness, and balance.
    • Bias/variance trade-off: The role of data in generalization and system efficiency.
  • 8.2.2 Creating and Managing Data Splits
    • Training, validation, and test splits: Definitions and best practices.
    • Avoiding common pitfalls:
      • Data leakage between splits.
      • Failing to stratify or account for time-series dependencies.
  • 8.2.3 Data Pipelines in Training
    • Efficient data loading: Preprocessing, augmentation, caching, and shuffling.
    • System considerations:
      • Disk I/O, memory usage, and batching.
      • Batch size and its effect on data pipeline performance.

8.3 Walkthrough: Training a Simple Network on MNIST

  • 8.3.1 Problem Setup
    • Example: Classifying handwritten digits with a feedforward network.
    • Dataset: Overview of MNIST and its relevance.
    • Model architecture: Input layer, one hidden layer, output layer.
  • 8.3.2 Step-by-Step Training Process
    • Forward pass: Computational implications of matrix multiplications.
    • Backward pass: Gradient computations and memory demands.
    • Parameter updates:
      • Optimizer choices (e.g., SGD, Adam) and their system implications.
      • Adam’s additional memory overhead compared to simpler optimizers.
    • Regularization techniques (e.g., dropout) and their impact on compute overhead.
  • 8.3.3 Key System Considerations
    • Compute: FLOPs required for forward and backward passes.
    • Memory: Activations, gradients, and parameter storage.
    • Data movement: Efficiently loading batches and minimizing I/O bottlenecks.
  • 8.3.4 Scaling the Example
    • Challenges of scaling MNIST with larger datasets or deeper networks.
    • Weight initialization (e.g., Xavier, He) and its effect on convergence speed.
    • Identifying bottlenecks in compute, memory, and data pipelines.

8.4 Training Workflows and Architectures

  • 8.4.1 Compute Infrastructure for Training
    • CPUs: General-purpose compute for preprocessing and small-scale tasks.
    • GPUs and TPUs: Accelerating deep learning through parallelism.
    • Specialized hardware: FPGAs and ASICs for ML workloads.
  • 8.4.2 Memory and Bandwidth Management
    • Memory allocation for activations, gradients, and parameters.
    • Bandwidth bottlenecks in distributed setups.
    • Strategies: Gradient checkpointing, memory reuse, and overlapping compute/data transfers.
  • 8.4.3 Distributed Training Architectures
    • Data parallelism: Splitting datasets across devices.
    • Model parallelism: Splitting model layers across devices.
    • Pipeline parallelism: Partitioning forward and backward passes into stages.
    • Synchronization and communication trade-offs in parallel systems.

8.5 Training Optimizations from a System Perspective

8.5.1 Mixed Precision Training

  • Using FP16/FP32 for memory savings and faster computation.
  • Hardware support (e.g., GPUs, TPUs) for mixed precision training.

8.5.2 Parallelism Strategies

  • Data, model, and pipeline parallelism: Practical applications and limitations.
  • Interaction of hyperparameters (e.g., batch size) with parallelism:
    • Larger batches reduce synchronization costs but require more memory.

8.5.3 Batch Size and Gradient Accumulation

  • Batch size trade-offs:
    • Large batches improve hardware utilization but strain memory and I/O.
    • Small batches reduce memory needs but increase overall training time.
  • Gradient accumulation: Simulating larger batches on memory-limited systems.

8.5.4 Optimization Algorithms and System Trade-offs

  • Optimizer impacts on system performance:
    • Adam’s memory requirements vs. SGD’s lower overhead.
    • Compute costs of momentum-based optimizers.
  • System implications of learning rate schedules (e.g., warmup, decay).

8.5.5 Communication Overhead in Distributed Training

  • Synchronizing gradients across nodes (e.g., AllReduce).
  • Techniques to minimize communication delays:
    • Overlapping compute and communication.
    • Asynchronous training for large-scale setups.

8.6 Bridging Algorithms to Hardware

8.6.1 Training Frameworks and Libraries

  • TensorFlow, PyTorch, and JAX for high-level training workflows.
  • Hardware-specific libraries: cuDNN, MKL-DNN, ROCm for GPUs.

8.6.2 Exploiting Hardware-Specific Features

  • GPU tensor cores and TPU optimizations for matrix multiplications.
  • Optimizer-specific impacts:
    • Adam’s memory overhead on GPUs vs. simpler optimizers like SGD.

8.6.3 Profiling and Debugging System Performance

  • Profiling tools: NVIDIA Nsight, PyTorch Profiler, TensorBoard.
  • Diagnosing bottlenecks in compute, memory, and data pipelines.

8.7 Case Studies: Training in Practice

8.7.1 Scaling MNIST at Scale

  • Scaling a simple workload across GPUs/TPUs.
  • System lessons: Bottlenecks in data pipelines, compute, and memory.

8.7.2 Training Large Language Models

  • System challenges in transformer-based architectures.
  • Memory optimization techniques: Sparse attention and tensor sharding.

8.7.3 Efficient Training for Computer Vision

  • Data augmentation pipelines and their impact on throughput.
  • Balancing compute and memory trade-offs in convolutional models.

8.8 Conclusion

  • 8.8.1 Recap of Training as a System Process

    • Key components: Data pipelines, compute, memory, and communication.
  • 8.8.2 Mitigating System Bottlenecks

    • Strategies for optimizing hyperparameters, optimizers, and resource allocation.
  • 8.8.3 Preparing for Scalable and Efficient ML Systems

    • Linking training workflows to deployment and production systems.
@profvjreddi commented on GitHub (Jan 21, 2025): Tentative draft will see how it comes out. --- ### Chapter 8: AI Training #### 8.1 Introduction to AI Training - **8.1.1 Why Systems Matter in Training** - Training as a system-driven process, beyond algorithmic convergence. - Bridging the gap between training workflows and hardware execution. - **8.1.2 Overview of the Training Pipeline** - Key steps: Forward pass, backward pass, and parameter updates. - System considerations: Compute, memory, data flow, and communication. - **8.1.3 Chapter Goals** - Understand training workflows from a systems perspective. - Explore system bottlenecks, hyperparameter trade-offs, and scaling optimizations. --- #### 8.2 Data and Preprocessing for Training - **8.2.1 Importance of Data in Training** - Characteristics of high-quality training data: Diversity, representativeness, and balance. - Bias/variance trade-off: The role of data in generalization and system efficiency. - **8.2.2 Creating and Managing Data Splits** - Training, validation, and test splits: Definitions and best practices. - Avoiding common pitfalls: - Data leakage between splits. - Failing to stratify or account for time-series dependencies. - **8.2.3 Data Pipelines in Training** - Efficient data loading: Preprocessing, augmentation, caching, and shuffling. - System considerations: - Disk I/O, memory usage, and batching. - Batch size and its effect on data pipeline performance. --- #### 8.3 Walkthrough: Training a Simple Network on MNIST - **8.3.1 Problem Setup** - Example: Classifying handwritten digits with a feedforward network. - Dataset: Overview of MNIST and its relevance. - Model architecture: Input layer, one hidden layer, output layer. - **8.3.2 Step-by-Step Training Process** - Forward pass: Computational implications of matrix multiplications. - Backward pass: Gradient computations and memory demands. - Parameter updates: - Optimizer choices (e.g., SGD, Adam) and their system implications. - Adam’s additional memory overhead compared to simpler optimizers. - Regularization techniques (e.g., dropout) and their impact on compute overhead. - **8.3.3 Key System Considerations** - Compute: FLOPs required for forward and backward passes. - Memory: Activations, gradients, and parameter storage. - Data movement: Efficiently loading batches and minimizing I/O bottlenecks. - **8.3.4 Scaling the Example** - Challenges of scaling MNIST with larger datasets or deeper networks. - Weight initialization (e.g., Xavier, He) and its effect on convergence speed. - Identifying bottlenecks in compute, memory, and data pipelines. --- #### 8.4 Training Workflows and Architectures - **8.4.1 Compute Infrastructure for Training** - CPUs: General-purpose compute for preprocessing and small-scale tasks. - GPUs and TPUs: Accelerating deep learning through parallelism. - Specialized hardware: FPGAs and ASICs for ML workloads. - **8.4.2 Memory and Bandwidth Management** - Memory allocation for activations, gradients, and parameters. - Bandwidth bottlenecks in distributed setups. - Strategies: Gradient checkpointing, memory reuse, and overlapping compute/data transfers. - **8.4.3 Distributed Training Architectures** - Data parallelism: Splitting datasets across devices. - Model parallelism: Splitting model layers across devices. - Pipeline parallelism: Partitioning forward and backward passes into stages. - Synchronization and communication trade-offs in parallel systems. --- #### 8.5 Training Optimizations from a System Perspective **8.5.1 Mixed Precision Training** - Using FP16/FP32 for memory savings and faster computation. - Hardware support (e.g., GPUs, TPUs) for mixed precision training. **8.5.2 Parallelism Strategies** - Data, model, and pipeline parallelism: Practical applications and limitations. - Interaction of hyperparameters (e.g., batch size) with parallelism: - Larger batches reduce synchronization costs but require more memory. **8.5.3 Batch Size and Gradient Accumulation** - Batch size trade-offs: - Large batches improve hardware utilization but strain memory and I/O. - Small batches reduce memory needs but increase overall training time. - Gradient accumulation: Simulating larger batches on memory-limited systems. **8.5.4 Optimization Algorithms and System Trade-offs** - Optimizer impacts on system performance: - Adam’s memory requirements vs. SGD’s lower overhead. - Compute costs of momentum-based optimizers. - System implications of learning rate schedules (e.g., warmup, decay). **8.5.5 Communication Overhead in Distributed Training** - Synchronizing gradients across nodes (e.g., AllReduce). - Techniques to minimize communication delays: - Overlapping compute and communication. - Asynchronous training for large-scale setups. --- #### 8.6 Bridging Algorithms to Hardware **8.6.1 Training Frameworks and Libraries** - TensorFlow, PyTorch, and JAX for high-level training workflows. - Hardware-specific libraries: cuDNN, MKL-DNN, ROCm for GPUs. **8.6.2 Exploiting Hardware-Specific Features** - GPU tensor cores and TPU optimizations for matrix multiplications. - Optimizer-specific impacts: - Adam’s memory overhead on GPUs vs. simpler optimizers like SGD. **8.6.3 Profiling and Debugging System Performance** - Profiling tools: NVIDIA Nsight, PyTorch Profiler, TensorBoard. - Diagnosing bottlenecks in compute, memory, and data pipelines. --- #### 8.7 Case Studies: Training in Practice **8.7.1 Scaling MNIST at Scale** - Scaling a simple workload across GPUs/TPUs. - System lessons: Bottlenecks in data pipelines, compute, and memory. **8.7.2 Training Large Language Models** - System challenges in transformer-based architectures. - Memory optimization techniques: Sparse attention and tensor sharding. **8.7.3 Efficient Training for Computer Vision** - Data augmentation pipelines and their impact on throughput. - Balancing compute and memory trade-offs in convolutional models. --- #### 8.8 Conclusion - **8.8.1 Recap of Training as a System Process** - Key components: Data pipelines, compute, memory, and communication. - **8.8.2 Mitigating System Bottlenecks** - Strategies for optimizing hyperparameters, optimizers, and resource allocation. - **8.8.3 Preparing for Scalable and Efficient ML Systems** - Linking training workflows to deployment and production systems.
Author
Owner

@profvjreddi commented on GitHub (Jan 23, 2025):

I copied in some class notes and have been pushing those updates.

@profvjreddi commented on GitHub (Jan 23, 2025): I copied in some class notes and have been pushing those updates.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#309