-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cannot use --hostfile to start multi-node training in Docker. #6875
Comments
Hi @Ind1x1 Thanks for raising up this question. I would suggest still using deepspeed launcher with ssh enabled. To make it work cross containers on different hosts. You should do following (assuming 2 containers located in 2 separate hosts):
and then
|
Thank you very much for your guidance @GuanhuaWang
|
Describe the bug
I used an overlay network to connect containers on two hosts for communication, and configured passwordless SSH along with the relevant /etc/hosts and hostfile. However, I was unable to start training with the command
deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py
. After checking deepspeed.ai, I found that I can start training using the "Launching without passwordless SSH" method with the commanddeepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=10.0.1.13 test.py
I would like to know what is causing this issue.These are the log from my training and some configurations.
`root@903c1e9c351c:/home/user/code# deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py
[2024-12-16 07:28:11,223] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 470, in main
subprocess.check_call(safe_ssh_cmd, stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-o', 'PasswordAuthentication=no', 'manager', 'hostname']' returned non-zero exit status 255.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 472, in main
raise RuntimeError(
RuntimeError: Using hostfile at hostfile but host=manager was not reachable via ssh. If you are running with a single node please remove hostfile or setup passwordless ssh.`
root@903c1e9c351c:/home/user/code# cat /etc/hosts 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 10.0.1.13 903c1e9c351c 10.0.1.13 manager 10.0.1.15 worker root@903c1e9c351c:/home/user/code# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 10.0.1.13 netmask 255.255.255.0 broadcast 10.0.1.255 ether 02:42:0a:00:01:0d txqueuelen 0 (Ethernet) RX packets 512 bytes 78880 (78.8 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 471 bytes 79480 (79.4 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
manager slots=1 worker slots=1
The text was updated successfully, but these errors were encountered: