mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 18:18:42 -05:00
[PR #1530] [MERGED] fix(mlsysim): hierarchical AllReduce broadcast 8x overestimate + H200 capacity unit mismatch #8241
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/harvard-edge/cs249r_book/pull/1530
Author: @Shashank-Tripathi-07
Created: 4/25/2026
Status: ✅ Merged
Merged: 4/25/2026
Merged by: @profvjreddi
Base:
dev← Head:fix/mlsysim-allreduce-and-h200-capacity📝 Commits (1)
13ea293fix(mlsysim): allreduce broadcast uses full M instead of M/G, H200 capacity in wrong unit📊 Changes
2 files changed (+3 additions, -2 deletions)
View changed files
📝
mlsysim/mlsysim/core/constants.py(+1 -1)📝
mlsysim/mlsysim/core/formulas.py(+2 -1)📄 Description
Summary
Two correctness bugs in
mlsysim/core, both producing silent wrong numbers:Bug 1 -- Hierarchical AllReduce broadcast phase 8x too slow (
formulas.py)calc_hierarchical_allreduce_time()models NCCL's 3-phase algorithm:Mbytes → each GPU holdsM/GM/GbytesM/Gbytes ← bug herePhase 3 was reusing
t_reduce(computed on fullM) via a# Symmetry assumptioncomment. But the broadcast only carriesM/Gbytes -- the reduced shard, not the full gradient.reduced_messagewas already computed for phase 2 and just needed to be passed to phase 3.Impact: On an 8-GPU node, broadcast time overestimated by 8x. For a 16B param model (~32 GB gradients at FP16) on 2×8 H100s, this overcounts ~220 µs per AllReduce step -- every multi-node DP training simulation reports pessimistically slow communication.
Bug 2 -- H200 memory capacity in decimal GB, everything else in binary GiB (
constants.py)Every other GPU (V100, A100, H100, B200) uses
GiB. The outlier caused ~7% overcounting of H200 memory capacity in cross-GPU comparisons and memory-fit calculations.Test plan
uv run --with pytest python -m pytest tests/-- 363 passed, 2 skipped, 0 failures🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.