Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分布式节点部署,launch时指定卡有问题 #2710

Open
syd1997 opened this issue Dec 26, 2024 · 4 comments
Open

分布式节点部署,launch时指定卡有问题 #2710

syd1997 opened this issue Dec 26, 2024 · 4 comments
Milestone

Comments

@syd1997
Copy link

syd1997 commented Dec 26, 2024

cuda 12.2
python3.10
transformers 4.47.0

xinference version :1.0.1

start supervisor

conda activate XXX
export XINFERENCE_HOME=/data/xinference
nohup xinference-supervisor -H $IP_ADDR

#start worker
conda activate XXX
export XINFERENCE_HOME=/data/xinference
export XINFERENCE_ENDPOINT=$IP_ADDR
nohup xinference-worker -e "$IP_ADDR:$PORT" -H $IP_ADDR
我有三个节点,分别是4卡,2卡,4卡,启动顺序也是4卡,2卡,4卡
当我启动三个不同的模型时,第一个模型指定了4卡,正常启动,第二个模型启动时最多只能选择2卡,将2卡的节点占用之后才能在启动第三个模型时选择4卡并在4卡的节点上运行。

@XprobeBot XprobeBot added the gpu label Dec 26, 2024
@XprobeBot XprobeBot added this to the v1.x milestone Dec 26, 2024
@syd1997
Copy link
Author

syd1997 commented Dec 26, 2024

都是在web上启动的

@qinxuye
Copy link
Contributor

qinxuye commented Dec 26, 2024

比较稳定的做法可能是通过 worker_ip 指定运行节点。

@syd1997
Copy link
Author

syd1997 commented Dec 26, 2024

之前有别人出现过这种情况吗?有没有优化的办法?未来会不会有优化?感谢!

Copy link

github-actions bot commented Jan 2, 2025

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants