OOM for all gather comms tests #84

roywei · 2023-08-17T17:52:27Z

Hi,

I'm trying to benchmark multi-node allgather perf using param tests for buffers up to 2G. but the test will OOM at buffer size around 1G. While the same config works for nccl-tests. Any ideas or insight will be helpful. Thank you!. AR and RS tests are fine and results are very similar to nccl-tests. You can reproduce this on A100-40G /H100 clusters. (p4d or p5 on AWS)

PyTorch nightly with cuda 12.1 or PyTorch 2.0.1 with CUDA 11.8

for param I'm launching the following way

mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
      --tag-output \
      --oversubscribe --allow-run-as-root \
      $MPI_OPTIONS /fsx/lawei/param/train/comms/pt/comms.py \
      --master-ip ip-172-31-49-213 \
      --b 32M \
     ---e 2048M \
      --n 100 \
      --z 0 \
      --backend nccl \
      --device cuda \
      --collective all_gather\

for nccl-test, I'm using NCCL 2.18.3 + CUDA 12.1, but older version also works.

mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
      --tag-output \
      --oversubscribe --allow-run-as-root \
      bash run_nccl_test.sh

and in the bash file

export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:$LD_LIBRARY_PATH
export NCCL_DEBUG=INFO
export FI_EFA_USE_DEVICE_RDMA=1
/usr/local/cuda-12.1/efa/test-cuda-12.1/all_gather_perf -b 32M -e 2048M  -n 100  -z 0 -f 2 -g 1

The text was updated successfully, but these errors were encountered:

louisfeng assigned louisfeng and shengbao-zheng and unassigned louisfeng Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM for all gather comms tests #84

OOM for all gather comms tests #84

roywei commented Aug 17, 2023 •

edited

Loading

OOM for all gather comms tests #84

OOM for all gather comms tests #84

Comments

roywei commented Aug 17, 2023 • edited Loading

roywei commented Aug 17, 2023 •

edited

Loading