[BUG] Cannot use --hostfile to start multi-node training in Docker. #6875

Ind1x1 · 2024-12-16T07:47:49Z

Describe the bug
I used an overlay network to connect containers on two hosts for communication, and configured passwordless SSH along with the relevant /etc/hosts and hostfile. However, I was unable to start training with the command deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py . After checking deepspeed.ai, I found that I can start training using the "Launching without passwordless SSH" method with the command
deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=10.0.1.13 test.py I would like to know what is causing this issue.

These are the log from my training and some configurations.
`root@903c1e9c351c:/home/user/code# deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py
[2024-12-16 07:28:11,223] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 470, in main
subprocess.check_call(safe_ssh_cmd, stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-o', 'PasswordAuthentication=no', 'manager', 'hostname']' returned non-zero exit status 255.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 472, in main
raise RuntimeError(
RuntimeError: Using hostfile at hostfile but host=manager was not reachable via ssh. If you are running with a single node please remove hostfile or setup passwordless ssh.`

root@903c1e9c351c:/home/user/code# cat /etc/hosts 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 10.0.1.13 903c1e9c351c 10.0.1.13 manager 10.0.1.15 worker root@903c1e9c351c:/home/user/code# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 10.0.1.13 netmask 255.255.255.0 broadcast 10.0.1.255 ether 02:42:0a:00:01:0d txqueuelen 0 (Ethernet) RX packets 512 bytes 78880 (78.8 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 471 bytes 79480 (79.4 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

manager slots=1 worker slots=1

The text was updated successfully, but these errors were encountered:

GuanhuaWang · 2024-12-18T22:54:12Z

Hi @Ind1x1

Thanks for raising up this question. I would suggest still using deepspeed launcher with ssh enabled. To make it work cross containers on different hosts. You should do following (assuming 2 containers located in 2 separate hosts):

make sure you strictly follow docker overlay network setup here. One final thing to check, say you have two nodes, first is swarm host1, second node(host2) is join host1's swarm network. Then after you launch container on host2, you should see that test-net in tutorial show on host2 when you do docker network ls on host2.
after setup pub key across the two containers (i.e. authorized_keys). you should try container1 ssh to continar2 (and vice versa) see if it is working. if this part is not working, it is highly possible that:
2.1 access permissions (i.e. chmod) on .ssh folder or authorized_keys are wrong.
2.2 you did not setup correctly on /etc/ssh/sshd_config, where you should make

port 22
PermitRootLogin yes
PubkeyAuthentication yes

and then service ssh restart on both containers

after above 2 steps working correctly, Then you should be able to use deepspeed launcher with ssh enabled.

Ind1x1 · 2024-12-29T08:55:05Z

Thank you very much for your guidance @GuanhuaWang
. But I face a new issue. When training on two machines, each with 8 A100 GPUs, I encountered the following error. However, when I reduced the number of GPUs to 4 per machine, the training worked fine. When training with 4 A100 GPUs per machine, I confirmed that both network cards are working properly.

50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<34696> with status=12 opcode=0 len=0 vendor err 129 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<47140> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<47140> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<47140> with status=5 opcode=0 len=0 vendor err 244 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<47140> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<47140> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 244 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]

50428e4af543:34610:36821 [4] ib_plugin.c:1673 NCCL WARN NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
50428e4af543:34610:36821 [4] NCCL INFO transport/net.cc:1314 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:638 -> 6
50428e4af543:34610:36821 [4] NCCL INFO proxy.cc:819 -> 6 [Progress Thread]
[rank4]:[E1229 16:47:25.333820755 ProcessGroupNCCL.cpp:541] [Rank 4] found async exception when checking for NCCL errors: NCCL error: remote process exited or there was a network error, NCCL version 2.22.3
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
Exception raised from checkForNCCLErrorsInternal at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1954 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fad0a5e2cc8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10b0a0e (0x7fad0b6ffa0e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x262 (0x7fad0b715742 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7fad0b7158bb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7fad0b71f773 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fad0b7215cd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fad5c2b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7fad68124ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126850 (0x7fad681b6850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank4]:[E1229 16:47:25.341226199 ProcessGroupNCCL.cpp:1722] [PG 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 370, last completed NCCL work: -1.
[rank4]:[E1229 16:47:25.341253866 ProcessGroupNCCL.cpp:629] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E1229 16:47:25.341267632 ProcessGroupNCCL.cpp:635] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E1229 16:47:25.341337557 ProcessGroupNCCL.cpp:1571] [PG 1 Rank 4] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.22.3
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
Exception raised from checkForNCCLErrorsInternal at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1954 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fad0a5e2cc8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10b0a0e (0x7fad0b6ffa0e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x262 (0x7fad0b715742 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7fad0b7158bb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7fad0b71f773 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fad0b7215cd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fad5c2b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7fad68124ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126850 (0x7fad681b6850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 1 Rank 4] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.22.3
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB: Got completion from peer 10.0.1.23<34696> with status=5 opcode=0 len=0 vendor err 249 (Recv) hca ibp152s0
Exception raised from checkForNCCLErrorsInternal at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1954 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fad0a5e2cc8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10b0a0e (0x7fad0b6ffa0e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x262 (0x7fad0b715742 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7fad0b7158bb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7fad0b71f773 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fad0b7215cd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fad5c2b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7fad68124ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126850 (0x7fad681b6850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1577 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fad0a5e2cc8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10b0a0e (0x7fad0b6ffa0e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd52020 (0x7fad0b3a1020 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7fad5c2b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7fad68124ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x126850 (0x7fad681b6850 in /lib/x86_64-linux-gnu/libc.so.6)```

Ind1x1 added bug Something isn't working training labels Dec 16, 2024

tjruwase assigned GuanhuaWang Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cannot use --hostfile to start multi-node training in Docker. #6875

[BUG] Cannot use --hostfile to start multi-node training in Docker. #6875

Ind1x1 commented Dec 16, 2024

GuanhuaWang commented Dec 18, 2024 •

edited

Loading

Ind1x1 commented Dec 29, 2024 •

edited

Loading

[BUG] Cannot use --hostfile to start multi-node training in Docker. #6875

[BUG] Cannot use --hostfile to start multi-node training in Docker. #6875

Comments

Ind1x1 commented Dec 16, 2024

GuanhuaWang commented Dec 18, 2024 • edited Loading

Ind1x1 commented Dec 29, 2024 • edited Loading

GuanhuaWang commented Dec 18, 2024 •

edited

Loading

Ind1x1 commented Dec 29, 2024 •

edited

Loading