[PR #1530] [MERGED] fix(mlsysim): hierarchical AllReduce broadcast 8x overestimate + H200 capacity unit mismatch #9196

New Issue

GiteaMirror · 2026-05-03T01:26:05-05:00

GiteaMirror commented

2026-05-03 01:26:05 -05:00

📋 Pull Request Information

Original PR: https://github.com/harvard-edge/cs249r_book/pull/1530
Author: @Shashank-Tripathi-07
Created: 4/25/2026
Status: ✅ Merged
Merged: 4/25/2026
Merged by: @profvjreddi

Base: dev ← Head: fix/mlsysim-allreduce-and-h200-capacity

📝 Commits (1)

13ea293 fix(mlsysim): allreduce broadcast uses full M instead of M/G, H200 capacity in wrong unit

📊 Changes

2 files changed (+3 additions, -2 deletions)

View changed files

📝 mlsysim/mlsysim/core/constants.py (+1 -1)
📝 mlsysim/mlsysim/core/formulas.py (+2 -1)

📄 Description

Summary

Two correctness bugs in mlsysim/core, both producing silent wrong numbers:

Bug 1 -- Hierarchical AllReduce broadcast phase 8x too slow (`formulas.py`)

calc_hierarchical_allreduce_time() models NCCL's 3-phase algorithm:

Intra-node reduce-scatter: M bytes → each GPU holds M/G
Inter-node AllReduce across nodes: M/G bytes
Intra-node broadcast back: M/G bytes ← bug here

Phase 3 was reusing t_reduce (computed on full M) via a # Symmetry assumption comment. But the broadcast only carries M/G bytes -- the reduced shard, not the full gradient. reduced_message was already computed for phase 2 and just needed to be passed to phase 3.

Impact: On an 8-GPU node, broadcast time overestimated by 8x. For a 16B param model (~32 GB gradients at FP16) on 2×8 H100s, this overcounts ~220 µs per AllReduce step -- every multi-node DP training simulation reports pessimistically slow communication.

# Before (wrong):
t_broadcast = t_reduce  # Symmetry assumption

# After (correct):
t_broadcast = calc_ring_allreduce_time(reduced_message, gpus_per_node, intra_node_bw, intra_node_lat)

Bug 2 -- H200 memory capacity in decimal GB, everything else in binary GiB (`constants.py`)

# Before:
H200_MEM_CAPACITY = 141 * GB   # decimal, 141×10⁹ bytes

# After:
H200_MEM_CAPACITY = 131 * GiB  # 141 GB converted to GiB, matches all other GPU constants

Every other GPU (V100, A100, H100, B200) uses GiB. The outlier caused ~7% overcounting of H200 memory capacity in cross-GPU comparisons and memory-fit calculations.

Test plan

uv run --with pytest python -m pytest tests/ -- 363 passed, 2 skipped, 0 failures

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/harvard-edge/cs249r_book/pull/1530 **Author:** [@Shashank-Tripathi-07](https://github.com/Shashank-Tripathi-07) **Created:** 4/25/2026 **Status:** ✅ Merged **Merged:** 4/25/2026 **Merged by:** [@profvjreddi](https://github.com/profvjreddi) **Base:** `dev` ← **Head:** `fix/mlsysim-allreduce-and-h200-capacity` --- ### 📝 Commits (1) - [`13ea293`](https://github.com/harvard-edge/cs249r_book/commit/13ea293980c3bbcaccd44c367523ae9b86759ea1) fix(mlsysim): allreduce broadcast uses full M instead of M/G, H200 capacity in wrong unit ### 📊 Changes **2 files changed** (+3 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `mlsysim/mlsysim/core/constants.py` (+1 -1) 📝 `mlsysim/mlsysim/core/formulas.py` (+2 -1) </details> ### 📄 Description ## Summary Two correctness bugs in `mlsysim/core`, both producing silent wrong numbers: ### Bug 1 -- Hierarchical AllReduce broadcast phase 8x too slow (`formulas.py`) `calc_hierarchical_allreduce_time()` models NCCL's 3-phase algorithm: 1. Intra-node reduce-scatter: `M` bytes → each GPU holds `M/G` 2. Inter-node AllReduce across nodes: `M/G` bytes 3. Intra-node broadcast back: `M/G` bytes ← **bug here** Phase 3 was reusing `t_reduce` (computed on full `M`) via a `# Symmetry assumption` comment. But the broadcast only carries `M/G` bytes -- the reduced shard, not the full gradient. `reduced_message` was already computed for phase 2 and just needed to be passed to phase 3. **Impact:** On an 8-GPU node, broadcast time overestimated by **8x**. For a 16B param model (~32 GB gradients at FP16) on 2×8 H100s, this overcounts ~220 µs per AllReduce step -- every multi-node DP training simulation reports pessimistically slow communication. ```python # Before (wrong): t_broadcast = t_reduce # Symmetry assumption # After (correct): t_broadcast = calc_ring_allreduce_time(reduced_message, gpus_per_node, intra_node_bw, intra_node_lat) ``` ### Bug 2 -- H200 memory capacity in decimal GB, everything else in binary GiB (`constants.py`) ```python # Before: H200_MEM_CAPACITY = 141 * GB # decimal, 141×10⁹ bytes # After: H200_MEM_CAPACITY = 131 * GiB # 141 GB converted to GiB, matches all other GPU constants ``` Every other GPU (V100, A100, H100, B200) uses `GiB`. The outlier caused ~7% overcounting of H200 memory capacity in cross-GPU comparisons and memory-fit calculations. ## Test plan - [x] `uv run --with pytest python -m pytest tests/` -- 363 passed, 2 skipped, 0 failures --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

GiteaMirror added the pull-request label 2026-05-03 01:26:05 -05:00

GiteaMirror closed this issue

2026-05-03 01:26:06 -05:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#9196

[PR #1530] [MERGED] fix(mlsysim): hierarchical AllReduce broadcast 8x overestimate + H200 capacity unit mismatch #9196