Files
cs249r_book/mlsysim/docs/api/core.solver.DistributedSolver.qmd
Vijay Janapa Reddi 2bbe3e1a69 docs(mlsysim): redesign website, add 12 tutorials, and CLI entry points
Replace 9 old tutorials with 12 new numbered tutorials (00-11) covering
roofline through full-stack audit. Redesign landing page, add
models-and-solvers and extending-the-engine guides. Add __main__.py,
cli.py, and cli/ package for command-line interface.
2026-03-12 16:04:51 -04:00

69 lines
4.1 KiB
Plaintext

# core.solver.DistributedModel { #mlsysim.core.solver.DistributedModel }
```python
core.solver.DistributedModel()
```
Resolves fleet-wide communication, synchronization, and pipelining constraints.
This solver models the constraints of distributed scale for distributed training. It
decomposes a workload across a cluster using 3D Parallelism (DP, TP, PP)
and calculates the resulting communication overheads and idle times
(bubbles) that determine the Model FLOPs Utilization (MFU).
Literature Source:
1. Shoeybi et al. (2019), "Megatron-LM: Training Multi-Billion Parameter
Language Models Using Model Parallelism." (3D Parallelism Framework)
2. Narayanan et al. (2019), "PipePipe: Efficient Pipeline Parallelism for
Training Large Models." (1F1B Pipeline Bubble Model)
3. Patarasuk & Mueller (2009), "Bandwidth-Optimal All-Reduce Algorithms
for Clusters of Workstations." (Ring All-Reduce)
## Methods
| Name | Description |
| --- | --- |
| [solve](#mlsysim.core.solver.DistributedModel.solve) | Calculates distributed training performance using the 3D/4D Parallelism model. |
### solve { #mlsysim.core.solver.DistributedModel.solve }
```python
core.solver.DistributedModel.solve(
model,
fleet,
batch_size=1,
precision='fp16',
efficiency=0.5,
tp_size=1,
pp_size=1,
ep_size=1,
v_stages=1,
microbatch_count=1,
topology_override=None,
)
```
Calculates distributed training performance using the 3D/4D Parallelism model.
#### Parameters {.doc-section .doc-section-parameters}
| Name | Type | Description | Default |
|-------------------|----------|------------------------------------------------------------------------------------------------------------------------------|------------|
| model | Workload | The model architecture to simulate. | _required_ |
| fleet | Fleet | The hardware cluster and network topology. | _required_ |
| batch_size | int | Global batch size. | `1` |
| precision | str | Numerical precision (fp16, fp32, int8). | `'fp16'` |
| efficiency | float | Achieved compute efficiency (0.0 to 1.0). | `0.5` |
| tp_size | int | Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink. | `1` |
| pp_size | int | Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing 'pipeline bubbles' while saving memory. | `1` |
| ep_size | int | Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes. | `1` |
| v_stages | int | Number of virtual stages for interleaved pipeline schedules. | `1` |
| microbatch_count | int | Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead. | `1` |
| topology_override | str | Force a specific topology (ring, tree). | `None` |
#### Returns {.doc-section .doc-section-returns}
| Name | Type | Description |
|--------|------------------|-----------------------------------------------------------------------------------------------------|
| | Dict\[str, Any\] | Metrics including DP/TP/EP latency, the Pipeline Bubble penalty, and the final Scaling Efficiency. |