mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-09 07:15:51 -05:00
refactor: anchor Volume 2 Distributed Training math to the Frontier Mission archetype; unify Vol 1 and Vol 2 logic
This commit is contained in:
@@ -712,22 +712,27 @@ Modern distributed training frameworks handle this distribution automatically th
|
||||
# │ Imports: mlsys.constants (GPT3_PARAMS, param, BILLION)
|
||||
# │ Exports: gpt3_params_b
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.constants import GPT3_PARAMS, param, BILLION
|
||||
from mlsys import Models, Applications
|
||||
from mlsys.constants import param, BILLION
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
class Gpt3TrainingContext:
|
||||
class FrontierTrainingContext:
|
||||
"""GPT-3 scale reference for distributed training."""
|
||||
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
params = GPT3_PARAMS.m_as(param)
|
||||
mission = Applications.Frontier
|
||||
model = mission.model
|
||||
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
params_b = params / BILLION
|
||||
params_b = model.parameters.m_as(param) / BILLION
|
||||
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
gpt3_params_b = f"{params_b:.0f}"
|
||||
frontier_params_b_str = fmt(params_b, precision=0)
|
||||
frontier_name = model.name
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
gpt3_params_b = Gpt3TrainingContext.gpt3_params_b
|
||||
frontier_params_b = FrontierTrainingContext.frontier_params_b_str
|
||||
frontier_name = FrontierTrainingContext.frontier_name
|
||||
```
|
||||
|
||||
#### Compute Phase: Forward and Backward Passes {#sec-distributed-training-systems-systems-compute-phase-forward-backward}
|
||||
@@ -747,27 +752,32 @@ gpt3_params_b = Gpt3TrainingContext.gpt3_params_b
|
||||
# │ Imports: mlsys.constants (GPT3_PARAMS, param, BILLION)
|
||||
# │ Exports: gpt3_params_b
|
||||
# └─────────────────────────────────────────────────────────────────────────────
|
||||
from mlsys.constants import GPT3_PARAMS, param, BILLION
|
||||
from mlsys import Models, Applications
|
||||
from mlsys.constants import param, BILLION
|
||||
from mlsys.formatting import fmt
|
||||
|
||||
class Gpt3TrainingContext:
|
||||
class FrontierTrainingContext:
|
||||
"""GPT-3 scale reference for distributed training."""
|
||||
|
||||
# ┌── 1. LOAD (Constants) ──────────────────────────────────────────────
|
||||
params = GPT3_PARAMS.m_as(param)
|
||||
mission = Applications.Frontier
|
||||
model = mission.model
|
||||
|
||||
# ┌── 2. EXECUTE (The Compute) ────────────────────────────────────────
|
||||
params_b = params / BILLION
|
||||
params_b = model.parameters.m_as(param) / BILLION
|
||||
|
||||
# ┌── 4. OUTPUT (Formatting) ──────────────────────────────────────────────
|
||||
gpt3_params_b = f"{params_b:.0f}"
|
||||
frontier_params_b_str = fmt(params_b, precision=0)
|
||||
frontier_name = model.name
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
gpt3_params_b = Gpt3TrainingContext.gpt3_params_b
|
||||
frontier_params_b = FrontierTrainingContext.frontier_params_b_str
|
||||
frontier_name = FrontierTrainingContext.frontier_name
|
||||
```
|
||||
|
||||
The defining feature of data parallelism is that the computation phase—both forward and backward—is **embarrassingly parallel**. Each GPU operates as an isolated island, executing an identical copy of the model on a unique micro-batch of data. For our `{python} gpt3_params_b`B parameter reference model, this isolation is critical: during the forward pass, each GPU independently computes activations for its local batch (micro-batch size 4, sequence length 2048). Without optimization, storing these activations for backpropagation would consume over 200 GB of HBM, exceeding the capacity of even an H100 GPU; techniques like **activation checkpointing**—recomputing activations during the backward pass rather than storing them—are mandatory to suppress this footprint to a manageable ~50 GB.
|
||||
The defining feature of data parallelism is that the computation phase—both forward and backward—is **embarrassingly parallel**. Each GPU operates as an isolated island, executing an identical copy of the model on a unique micro-batch of data. For our `{python} frontier_params_b`B parameter reference model, this isolation is critical: during the forward pass, each GPU independently computes activations for its local batch (micro-batch size 4, sequence length 2048). Without optimization, storing these activations for backpropagation would consume over 200 GB of HBM, exceeding the capacity of even an H100 GPU; techniques like **activation checkpointing**—recomputing activations during the backward pass rather than storing them—are mandatory to suppress this footprint to a manageable ~50 GB.
|
||||
|
||||
The backward pass mirrors this independence but introduces the system's primary bottleneck. As the GPU traverses the computation graph in reverse, it computes gradients for every parameter in the model. For a `{python} gpt3_params_b`B model in FP16, this generates a 350 GB gradient payload per GPU. While the computation itself requires zero communication, the resulting gradients represent a fractured view of the true loss surface—valid only for the local micro-batch. Before the optimizer step can occur, these local gradients must be aggregated across all $N$ GPUs to form a valid global gradient. This transition—from isolated, high-throughput compute to a massive, global synchronization event—defines the rhythm of data parallel training: long periods of silent, intense arithmetic punctuated by bursts of heavy network traffic.
|
||||
The backward pass mirrors this independence but introduces the system's primary bottleneck. As the GPU traverses the computation graph in reverse, it computes gradients for every parameter in the model. For a `{python} frontier_params_b`B model in FP16, this generates a 350 GB gradient payload per GPU. While the computation itself requires zero communication, the resulting gradients represent a fractured view of the true loss surface—valid only for the local micro-batch. Before the optimizer step can occur, these local gradients must be aggregated across all $N$ GPUs to form a valid global gradient. This transition—from isolated, high-throughput compute to a massive, global synchronization event—defines the rhythm of data parallel training: long periods of silent, intense arithmetic punctuated by bursts of heavy network traffic.
|
||||
|
||||
#### Gradient Synchronization {#sec-distributed-training-systems-systems-gradient-synchronization-614b}
|
||||
|
||||
@@ -935,7 +945,7 @@ Data parallelism is the default strategy for a reason: it scales **throughput**
|
||||
|
||||
Data parallelism offers three principal advantages. First, throughput scales linearly for compute-bound models: scaling ResNet-50 on ImageNet from 1 to 256 GPUs yields near-linear speedup because the gradient exchange is small relative to the compute time. Second, the model architecture remains unchanged; the framework wraps the model in a data-parallel container that intercepts backward-pass hooks to trigger gradient synchronization automatically. Third, utilization remains high because, unlike model parallelism, there are no pipeline bubbles — all GPUs work on the forward and backward pass simultaneously.
|
||||
|
||||
These advantages, however, encounter three hard ceilings. The memory wall requires every GPU to hold a full copy of the model parameters, gradients, and optimizer states; for a `{python} gpt3_params_b`B parameter model, this demands more than 1 TB of memory per GPU, which is physically impossible on current hardware without ZeRO sharding. The bandwidth wall emerges as $N$ grows: the AllReduce cost $2(N-1)/N \times M/B$ eventually dominates, and for large language models gradient synchronization can consume more than 50% of the step time, collapsing efficiency. The batch size trap compounds the problem: scaling to thousands of GPUs requires increasing the global batch size ($B_{global} = N \times B_{local}$), and eventually the **Critical Batch Size** is reached, where adding more data per step yields diminishing returns in convergence.
|
||||
These advantages, however, encounter three hard ceilings. The memory wall requires every GPU to hold a full copy of the model parameters, gradients, and optimizer states; for a `{python} frontier_params_b`B parameter model, this demands more than 1 TB of memory per GPU, which is physically impossible on current hardware without ZeRO sharding. The bandwidth wall emerges as $N$ grows: the AllReduce cost $2(N-1)/N \times M/B$ eventually dominates, and for large language models gradient synchronization can consume more than 50% of the step time, collapsing efficiency. The batch size trap compounds the problem: scaling to thousands of GPUs requires increasing the global batch size ($B_{global} = N \times B_{local}$), and eventually the **Critical Batch Size** is reached, where adding more data per step yields diminishing returns in convergence.
|
||||
|
||||
A concrete scaling experiment reveals how these ceilings manifest in practice.
|
||||
|
||||
@@ -1352,7 +1362,7 @@ AllReduce complexity depends on two components: latency ($\alpha$) and bandwidth
|
||||
|
||||
Interconnect selection determines whether large-scale deployments remain compute-bound or collapse into communication-bound regimes.
|
||||
|
||||
The bandwidth requirements for efficient distributed training are substantial, particularly for transformer models. Efficient systems require 100--400 GB/s aggregate bandwidth per node for transformer architectures. BERT-Base (110M parameters) requires approximately 440 MB of gradient synchronization per iteration in FP32, while BERT-Large (340M parameters) requires approximately 1.4 GB. Across 64 GPUs, these synchronization demands require 100-200 GB/s sustained bandwidth for sub-50 ms synchronization latency. Language models with `{python} gpt3_params_b`B parameters require 700 GB/s aggregate bandwidth to maintain 80% parallel efficiency, necessitating InfiniBand HDR or equivalent interconnects.
|
||||
The bandwidth requirements for efficient distributed training are substantial, particularly for transformer models. Efficient systems require 100--400 GB/s aggregate bandwidth per node for transformer architectures. BERT-Base (110M parameters) requires approximately 440 MB of gradient synchronization per iteration in FP32, while BERT-Large (340M parameters) requires approximately 1.4 GB. Across 64 GPUs, these synchronization demands require 100-200 GB/s sustained bandwidth for sub-50 ms synchronization latency. Language models with `{python} frontier_params_b`B parameters require 700 GB/s aggregate bandwidth to maintain 80% parallel efficiency, necessitating InfiniBand HDR or equivalent interconnects.
|
||||
|
||||
Synchronization frequency presents a trade-off between communication efficiency and convergence behavior. Gradient accumulation reduces synchronization frequency but increases memory requirements and may impact convergence. Synchronizing every 4 steps reduces communication overhead by 60% while increasing memory usage by 3$\times$ for gradient storage. Asynchronous methods eliminate synchronization costs entirely but introduce staleness that degrades convergence by 15--30% for large learning rates.
|
||||
|
||||
@@ -1784,7 +1794,7 @@ The choice among these methods depends on the specific bottleneck. When network
|
||||
|
||||
When a model is so massive that even a single layer's weights exceed the memory capacity of a GPU, data parallelism entirely collapses. The memory optimization techniques examined in the previous section extend data parallelism's reach, but eventually, we must partition the model itself.
|
||||
|
||||
Even with ZeRO-3 fully deployed, sharding optimizer states, gradients, and parameters across workers, some architectures remain intractable. A `{python} gpt3_params_b`B parameter model using FSDP across 64 GPUs still requires 700 GB / 64 = 11 GB of parameters per GPU before accounting for activations. For long-context transformers where activation memory dominates, a 2048-token sequence through `{python} gpt3_params_b`B parameters generates 200+ GB of intermediate activations, and no amount of optimizer sharding addresses this constraint. Model parallelism addresses these limitations by splitting the model architecture itself across devices, rather than replicating it with sharded state.
|
||||
Even with ZeRO-3 fully deployed, sharding optimizer states, gradients, and parameters across workers, some architectures remain intractable. A `{python} frontier_params_b`B parameter model using FSDP across 64 GPUs still requires 700 GB / 64 = 11 GB of parameters per GPU before accounting for activations. For long-context transformers where activation memory dominates, a 2048-token sequence through `{python} frontier_params_b`B parameters generates 200+ GB of intermediate activations, and no amount of optimizer sharding addresses this constraint. Model parallelism addresses these limitations by splitting the model architecture itself across devices, rather than replicating it with sharded state.
|
||||
|
||||
```{python}
|
||||
#| label: a100-capacity-context
|
||||
@@ -1817,12 +1827,12 @@ a100_mem = A100CapacityContext.a100_mem
|
||||
```
|
||||
|
||||
::: {.callout-notebook title="The Memory Wall of Scale"}
|
||||
**Problem**: You want to train a **`{python} gpt3_params_b` Billion parameter** model (like GPT-3) on NVIDIA A100s (`{python} a100_mem` GB). Can you use Data Parallelism with ZeRO-3?
|
||||
**Problem**: You want to train a **`{python} frontier_params_b` Billion parameter** model (like GPT-3) on NVIDIA A100s (`{python} a100_mem` GB). Can you use Data Parallelism with ZeRO-3?
|
||||
|
||||
**The Math**:
|
||||
|
||||
1. **Parameter Storage**: `{python} gpt3_params_b`B params$\times$ 2 bytes (FP16) = **350 GB**.
|
||||
2. **Optimizer State**: `{python} gpt3_params_b`B params$\times$ 12 bytes (Adam FP32) = **2,100 GB**.
|
||||
1. **Parameter Storage**: `{python} frontier_params_b`B params$\times$ 2 bytes (FP16) = **350 GB**.
|
||||
2. **Optimizer State**: `{python} frontier_params_b`B params$\times$ 12 bytes (Adam FP32) = **2,100 GB**.
|
||||
3. **Total Static Memory**: 2,450 GB.
|
||||
4. **ZeRO-3 Sharding**: With 64 GPUs, per-GPU static memory = $2,450 / 64 \approx \mathbf{38 \text{ GB}}$.
|
||||
5. **Activation Memory**: For sequence length 2048 and batch size 1, a 96-layer transformer generates $\approx \mathbf{50 \text{ GB}}$ of activations per GPU.
|
||||
@@ -1834,7 +1844,7 @@ Model parallelism addresses this limitation...
|
||||
|
||||
Several implementations of model parallelism exist. In layer-based splitting, devices process distinct groups of layers sequentially. The first device might compute layers 1-4 while the second handles layers 5-8. Channel-based splitting divides the channels within each layer across devices, where the first device processes 512 channels while the second manages the remaining ones. For transformer architectures, attention head splitting distributes different attention heads to separate devices.
|
||||
|
||||
This distribution method enables training of large-scale models. GPT-3, with `{python} gpt3_params_b` billion parameters, relies on model parallelism for training. Vision transformers processing high-resolution 16k$\times$ 16k pixel images use model parallelism to manage memory constraints. Mixture-of-Experts architectures use this approach to distribute their conditional computation paths across hardware.
|
||||
This distribution method enables training of large-scale models. GPT-3, with `{python} frontier_params_b` billion parameters, relies on model parallelism for training. Vision transformers processing high-resolution 16k$\times$ 16k pixel images use model parallelism to manage memory constraints. Mixture-of-Experts architectures use this approach to distribute their conditional computation paths across hardware.
|
||||
|
||||
Device coordination follows a specific pattern during training. In the forward pass, data flows sequentially through model segments on different devices. The backward pass propagates gradients in reverse order through these segments. During parameter updates, each device modifies only its assigned portion of the model. This coordination ensures mathematical equivalence to training on a single device while enabling the handling of models that exceed individual device memory capacities.
|
||||
|
||||
@@ -1896,9 +1906,9 @@ Model parallelism divides neural networks across multiple computing devices, wit
|
||||
```
|
||||
:::
|
||||
|
||||
Consider our running example: the `{python} gpt3_params_b`B parameter model requires 350 GB of memory in FP16, exceeding the `{python} a100_mem` GB capacity of a single A100 by a factor of four. Model parallelism addresses this **capacity wall** by partitioning the model's state—parameters, gradients, and optimizer states—across multiple devices, effectively stitching them into a single super-accelerator. Unlike data parallelism, where every GPU holds a full replica of the model and processes a unique fraction of the global batch, model parallelism requires each GPU to hold a unique fraction of the model and process the *same* data stream sequentially. With 8-way partitioning on A100s, each GPU holds approximately 44 GB of parameters—a tight fit that leaves roughly 36 GB for activations and optimizer state.
|
||||
Consider our running example: the `{python} frontier_params_b`B parameter model requires 350 GB of memory in FP16, exceeding the `{python} a100_mem` GB capacity of a single A100 by a factor of four. Model parallelism addresses this **capacity wall** by partitioning the model's state—parameters, gradients, and optimizer states—across multiple devices, effectively stitching them into a single super-accelerator. Unlike data parallelism, where every GPU holds a full replica of the model and processes a unique fraction of the global batch, model parallelism requires each GPU to hold a unique fraction of the model and process the *same* data stream sequentially. With 8-way partitioning on A100s, each GPU holds approximately 44 GB of parameters—a tight fit that leaves roughly 36 GB for activations and optimizer state.
|
||||
|
||||
In a typical pipeline parallel implementation, the training loop operates as a relay race. The forward pass initiates on GPU 1, which computes the initial transformer blocks and transmits the resulting intermediate activation tensor across the interconnect to GPU 2. For our `{python} gpt3_params_b`B model with a hidden dimension of 12,288 and a micro-batch size of 4 sequences at 2,048 tokens each, this handoff involves moving approximately 200 MB of data per stage boundary per step. GPU 2 must wait for this payload before it can begin its computation, creating a strict dependency chain that propagates through all stages. The backward pass mirrors this path in reverse, propagating error gradients from the final layer back to the input, with each device computing gradients only for its local parameters.
|
||||
In a typical pipeline parallel implementation, the training loop operates as a relay race. The forward pass initiates on GPU 1, which computes the initial transformer blocks and transmits the resulting intermediate activation tensor across the interconnect to GPU 2. For our `{python} frontier_params_b`B model with a hidden dimension of 12,288 and a micro-batch size of 4 sequences at 2,048 tokens each, this handoff involves moving approximately 200 MB of data per stage boundary per step. GPU 2 must wait for this payload before it can begin its computation, creating a strict dependency chain that propagates through all stages. The backward pass mirrors this path in reverse, propagating error gradients from the final layer back to the input, with each device computing gradients only for its local parameters.
|
||||
|
||||
This architecture fundamentally changes the optimization dynamics compared to data parallelism. Instead of a global AllReduce to average gradients across replicas, each GPU performs a local optimizer step (Adam [@kingma2015adam], AdaFactor, or similar) on its specific slice of parameters. A device holding transformer layers 1–12 updates only those layers' weights and biases, with no cross-device synchronization required during the optimization step. While this eliminates the bandwidth-heavy gradient synchronization of data parallelism, it trades one bottleneck for another: **pipeline bubbles**. If the layers assigned to GPU 1 are computationally heavier than those on GPU 2—common when attention layers have different head counts or when embedding layers are unevenly sized—valuable compute cycles are lost to waiting. The primary engineering challenge thus shifts from maximizing arithmetic intensity to minimizing serialization latency and ensuring balanced load across the partitioned fleet [@deepspeed_training_system_2021].
|
||||
|
||||
@@ -2311,7 +2321,7 @@ This distinction dictates fundamentally different cluster designs: dense GPU pod
|
||||
|
||||
Model parallelism breaks the memory wall but introduces sequential dependencies that reduce hardware utilization. The engineering challenge is balancing **pipeline bubbles** (idle time) against **all-to-all bandwidth** (communication time).
|
||||
|
||||
Model parallelism offers three principal advantages. Memory scaling enables training of models that exceed single-device capacity: with 8-way tensor parallelism, a `{python} gpt3_params_b`B model fits comfortably on A100s. Splitting the model also allows larger global batch sizes without out-of-memory errors, since each GPU processes a smaller parameter slice. The approach maps naturally to the physical structure of transformers, where attention heads split via tensor parallelism and layers split via pipeline parallelism.
|
||||
Model parallelism offers three principal advantages. Memory scaling enables training of models that exceed single-device capacity: with 8-way tensor parallelism, a `{python} frontier_params_b`B model fits comfortably on A100s. Splitting the model also allows larger global batch sizes without out-of-memory errors, since each GPU processes a smaller parameter slice. The approach maps naturally to the physical structure of transformers, where attention heads split via tensor parallelism and layers split via pipeline parallelism.
|
||||
|
||||
These advantages come at the cost of three fundamental limitations. Pipeline bubbles cause GPUs to sit idle while filling and draining the pipeline; the bubble fraction is $(P-1)/M$, where $P$ is pipeline stages and $M$ is microbatches, and achieving more than 90% efficiency requires $M \gg P$, which increases activation memory. Communication intensity in tensor parallelism is equally constraining: 2 AllReduce operations execute *per layer* on the critical path, demanding extremely high-bandwidth, low-latency interconnects (NVLink) and typically preventing scaling beyond a single node (8 GPUs) before hitting the bandwidth wall. Implementation complexity rounds out the trade-off, requiring invasive changes to the model definition — replacing standard linear layers with column-parallel and row-parallel variants — unlike data parallelism, which wraps the model externally without modifying internals.
|
||||
|
||||
@@ -2347,7 +2357,7 @@ This three-precision approach (FP32 master weights, FP8 GEMMs, FP16 accumulation
|
||||
|
||||
Training a frontier model when data parallelism runs out of memory and model parallelism runs out of network bandwidth requires orchestrating both strategies simultaneously across three dimensions. The preceding sections revealed a fundamental tension: data parallelism scales throughput but demands massive memory, while model parallelism enables large models but starves the compute.
|
||||
|
||||
Hybrid parallelism resolves this tension by applying both strategies orthogonally: model parallelism splits the architecture to fit available memory, while data parallelism scales throughput across multiple model replicas. Training a `{python} gpt3_params_b` billion parameter language model on a dataset of 300 billion tokens demonstrates this approach in practice. The neural network layers distribute across multiple GPUs through model parallelism, while data parallelism enables different GPU groups to process separate batches. This dual strategy addresses both memory constraints from model size and computational demands from dataset scale simultaneously, and it is precisely this combination that defines *Archetype A* training at frontier scale.
|
||||
Hybrid parallelism resolves this tension by applying both strategies orthogonally: model parallelism splits the architecture to fit available memory, while data parallelism scales throughput across multiple model replicas. Training a `{python} frontier_params_b` billion parameter language model on a dataset of 300 billion tokens demonstrates this approach in practice. The neural network layers distribute across multiple GPUs through model parallelism, while data parallelism enables different GPU groups to process separate batches. This dual strategy addresses both memory constraints from model size and computational demands from dataset scale simultaneously, and it is precisely this combination that defines *Archetype A* training at frontier scale.
|
||||
|
||||
::: {.callout-lighthouse title="Archetype A (GPT-4 / Llama-3): Physics of 3D Parallelism"}
|
||||
**Archetype A (GPT-4 / Llama-3)** is the primary driver for hybrid parallelism. Because the model parameters ($P$) exceed the memory of any single accelerator ($M_{device}$), and the training dataset ($D$) requires massive throughput, we must split the problem along three orthogonal axes:
|
||||
@@ -2361,7 +2371,7 @@ Only by combining all three can we train Archetype A systems efficiently.
|
||||
|
||||
### The 3D Training Loop {#sec-distributed-training-systems-systems-hybrid-parallelism-3d-loop}
|
||||
|
||||
Training a `{python} gpt3_params_b`B parameter model requires orchestrating computation across thousands of devices through **3D Parallelism**[^fn-3d-parallelism]. This approach does not merely sum the benefits of individual parallelism strategies; it composes them geometrically to match the physical topology of the hardware.
|
||||
Training a `{python} frontier_params_b`B parameter model requires orchestrating computation across thousands of devices through **3D Parallelism**[^fn-3d-parallelism]. This approach does not merely sum the benefits of individual parallelism strategies; it composes them geometrically to match the physical topology of the hardware.
|
||||
|
||||
[^fn-3d-parallelism]: **3D Parallelism**: Named after the three orthogonal axes of decomposition: (1) Data Parallelism (batch), (2) Tensor Parallelism (layer width), and (3) Pipeline Parallelism (depth). Organizations visualize their training fleets as a 3D grid $(d, t, p)$, where the product $d \times t \times p$ equals the total GPU count. This geometric perspective is essential for balancing the tiered bandwidth constraints of modern clusters. \index{3D Parallelism!etymology}
|
||||
Consider a training fleet configured with Tensor Parallelism (TP) of 8, Pipeline Parallelism (PP) of 16, and Data Parallelism (DP) of 128. This configuration uses 16,384 GPUs ($8 \times 16 \times 128$) organized into a hierarchy of bandwidth domains.
|
||||
@@ -2710,7 +2720,7 @@ Engineers assume more GPUs always accelerate training. In production, statistica
|
||||
|
||||
Pitfall: ***Choosing parallelism strategy based solely on memory constraints.***
|
||||
|
||||
Engineers see that a 70B model exceeds `{python} a100_mem`GB GPU memory and immediately choose tensor parallelism or pipeline parallelism to split weights. In production, the optimal strategy depends on the interaction between memory pressure, computation patterns, and communication topology. As @sec-distributed-training-systems-systems-parallelism-strategy-comparison-d92a explains, tensor parallelism splits each layer across devices with AllReduce synchronization per layer, achieving even memory distribution but placing communication on the critical path. Pipeline parallelism assigns complete layers to stages with point-to-point transfers between stages, reducing per-step communication but introducing pipeline bubble overhead that wastes 10-30% of cycles. For a `{python} gpt3_params_b`B model on 64 A100 GPUs where tensor parallelism degree-8 enables training, pipeline parallelism with 8 stages achieves 23% higher throughput due to reduced all-to-all communication despite similar memory footprints. The decision requires profiling communication patterns and bubble overhead, not just checking if weights fit in memory.
|
||||
Engineers see that a 70B model exceeds `{python} a100_mem`GB GPU memory and immediately choose tensor parallelism or pipeline parallelism to split weights. In production, the optimal strategy depends on the interaction between memory pressure, computation patterns, and communication topology. As @sec-distributed-training-systems-systems-parallelism-strategy-comparison-d92a explains, tensor parallelism splits each layer across devices with AllReduce synchronization per layer, achieving even memory distribution but placing communication on the critical path. Pipeline parallelism assigns complete layers to stages with point-to-point transfers between stages, reducing per-step communication but introducing pipeline bubble overhead that wastes 10-30% of cycles. For a `{python} frontier_params_b`B model on 64 A100 GPUs where tensor parallelism degree-8 enables training, pipeline parallelism with 8 stages achieves 23% higher throughput due to reduced all-to-all communication despite similar memory footprints. The decision requires profiling communication patterns and bubble overhead, not just checking if weights fit in memory.
|
||||
|
||||
Fallacy: ***FSDP and ZeRO always improve training efficiency.***
|
||||
|
||||
|
||||
Reference in New Issue
Block a user