分布式节点部署，launch时指定卡有问题 #2710

syd1997 · 2024-12-26T11:21:16Z

cuda 12.2
python3.10
transformers 4.47.0

xinference version ：1.0.1

start supervisor

conda activate XXX
export XINFERENCE_HOME=/data/xinference
nohup xinference-supervisor -H $IP_ADDR

#start worker
conda activate XXX
export XINFERENCE_HOME=/data/xinference
export XINFERENCE_ENDPOINT=$IP_ADDR
nohup xinference-worker -e "$IP_ADDR:$PORT" -H $IP_ADDR
我有三个节点，分别是4卡，2卡，4卡，启动顺序也是4卡，2卡，4卡
当我启动三个不同的模型时，第一个模型指定了4卡，正常启动，第二个模型启动时最多只能选择2卡，将2卡的节点占用之后才能在启动第三个模型时选择4卡并在4卡的节点上运行。

syd1997 · 2024-12-26T11:21:58Z

都是在web上启动的

qinxuye · 2024-12-26T11:24:29Z

比较稳定的做法可能是通过 worker_ip 指定运行节点。

syd1997 · 2024-12-26T11:33:00Z

之前有别人出现过这种情况吗？有没有优化的办法？未来会不会有优化？感谢！

github-actions · 2025-01-02T19:03:38Z

This issue is stale because it has been open for 7 days with no activity.

XprobeBot added the gpu label Dec 26, 2024

XprobeBot added this to the v1.x milestone Dec 26, 2024

github-actions bot added the stale label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

分布式节点部署，launch时指定卡有问题 #2710

分布式节点部署，launch时指定卡有问题 #2710

syd1997 commented Dec 26, 2024

syd1997 commented Dec 26, 2024

qinxuye commented Dec 26, 2024

syd1997 commented Dec 26, 2024

github-actions bot commented Jan 2, 2025

分布式节点部署，launch时指定卡有问题 #2710

分布式节点部署，launch时指定卡有问题 #2710

Comments

syd1997 commented Dec 26, 2024

start supervisor

syd1997 commented Dec 26, 2024

qinxuye commented Dec 26, 2024

syd1997 commented Dec 26, 2024

github-actions bot commented Jan 2, 2025