启动xinference平台报错，可能是异步超时 #2723

feifei05 · 2024-12-30T10:34:37Z

System Info / 系統信息

python 3.10.12
linux
NPU

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

1.1.0

The command used to start Xinference / 用以启动 xinference 的命令

XINFERENCE_MODEL_SRC=modelscope nohup xinference-local --host 123.12.1.123 --port 39997 > xinference_start.log 2>&1 &

Reproduction / 复现过程

在resource.py增加使用昇腾显卡计算npu已使用显存的代码，gather_node_info函数返回的node_resource能打印出来，然后在worker.py一直报错在代码in report_status
async with timeout(2)处报错Report status got error. 下面是输出的node_resource和报错

node_resource {'cpu': ResourceStatus(usage=0.008, total=192, memory_used=46002270208, memory_available=1115963953152, memory_total=1622527479808), 'gpu-0': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-1': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-2': GPUStatus(mem_total=64424509440.0, mem_free=15461882265.599998, mem_used=48962627174.4), 'gpu-3': GPUStatus(mem_total=64424509440.0, mem_free=26414048870.4, mem_used=38010460569.6), 'gpu-4': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-5': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-6': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-7': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996)}

报错

Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1155, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1154, in report_status
async with timeout(2):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2024-12-30 18:06:14,038 xinference.core.worker 2544354 ERROR Report status got error.
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1155, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1154, in report_status
async with timeout(2):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2024-12-30 18:06:21,042 xinference.core.worker 2544354 ERROR Report status got error.
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1155, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1154, in report_status
async with timeout(2):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

Expected behavior / 期待表现

希望您能正常启动平台。

codingl2k1 · 2024-12-30T10:36:35Z

看着是2s没返回结果timeout了。有没有哪个进程CPU 100%？

feifei05 · 2024-12-31T02:00:56Z

看着是2s没返回结果timeout了。有没有哪个进程CPU 100%？
用top查看了下，确实有个进程的cpu占了100%，然后如何处理，大佬

codingl2k1 · 2025-01-02T12:10:42Z

看着是2s没返回结果timeout了。有没有哪个进程CPU 100%？
用top查看了下，确实有个进程的cpu占了100%，然后如何处理，大佬

如果能py-spy看一下这个出错的worker进程调用栈就好了，也可能是torch npu的模块hang住了。一般查个资源占用不应该这么久都无法返回。 @qinxuye 这块儿关掉这个npu资源上报会有啥影响吗？

qinxuye · 2025-01-02T13:13:36Z

这个是你自己修改的？开源没有ascend信息获取。

feifei05 · 2025-01-03T02:13:31Z

这个是你自己修改的？开源没有ascend信息获取。

自己新增的npu信息，写了判断

feifei05 · 2025-01-03T02:57:53Z

看着是2s没返回结果timeout了。有没有哪个进程CPU 100%？
用top查看了下，确实有个进程的cpu占了100%，然后如何处理，大佬

如果能py-spy看一下这个出错的worker进程调用栈就好了，也可能是torch npu的模块hang住了。一般查个资源占用不应该这么久都无法返回。 @qinxuye 这块儿关掉这个npu资源上报会有啥影响吗？

我们把超时时间改成10s， async with timeout(10):
status = await asyncio.to_thread(gather_node_info)
然后启动平台，能启动，但是在第一次上报状态信息会报错，后面就没有报错，这是啥原因？
为啥启动后多了multiprocessing.forkserver很多进程？之前正常启动平台没有这些进程。

我们把deploy文件夹的local.py代码里面启动方法改成了"spawn"，会有什么风险吗？是不是有影响？
版本信息
Python 3.10.12
torch 2.1.0
torch-npu 2.1.0.post8
torchvision 0.16.0

这是deploy/local.py
def run_in_subprocess(
address: str,
metrics_exporter_host: Optional[str] = None,
metrics_exporter_port: Optional[int] = None,
logging_conf: Optional[Dict] = None,
) -> multiprocessing.Process:
multiprocessing.set_start_method(method="spawn", force=True)
p = multiprocessing.Process(
target=run,
args=(address, metrics_exporter_host, metrics_exporter_port, logging_conf),
)
p.start()
return p

这是输出的报错
线程开始时间: 1735871408.2499998
node_info获取开始时间: 1735871408.2505193
device_count: 8
2025-01-03 10:30:18,258 xinference.core.worker 4135310 ERROR Report status got error.
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1167, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1166, in report_status
async with timeout(10):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2025-01-03 10:30:51,578 xinference.core.supervisor 4135310 ERROR Worker timeout. address: 172.16.1.149:23192, check count remaining 4...
...
2025-01-03 10:31:36,614 xinference.core.supervisor 4135310 ERROR Worker timeout. address: 172.16.1.149:23192, check count remaining 4...

这是进程
root 4135309 0.0 0.0 225072 11276 ? root 4135310 5.5 0.0 8603240124 1454612 ? Sl root 4135809 0.2 0.0 1373344 90332 ? root 4135870 0.0 0.0 226564 12212 ? root 4135871 0.3 0.0 760356 371800 ? root 4135872 0.3 0.0 760356 370692 ? root 4135873 0.3 0.0 760356 371096 ? root 4135874 0.3 0.0 760364 370868 ? root 4135875 0.3 0.0 760356 371684 ? root 4135876 0.3 0.0 760356 370384 ? root 4135877 0.3 0.0 760356 370868 ? root 4135878 0.3 0.0 760356 371272 ? root 4135879 0.3 0.0 760360 372020 ? root 4135889 0.3 0.0 760276 373780 ? root 4135890 0.3 0.0 760276 372152 ? root 4135891 0.3 0.0 760268 373576 ? root 4135892 0.3 0.0 760280 373840 ? root 4135893 0.3 0.0 760268 372488 ? root 4135894 0.3 0.0 760268 372788 ? root 4135895 0.3 0.0 760272 372008 ? root 4135896 0.3 0.0 760348 372288 ? S 10:29 0:00 /run/xinference/.venv/bin/python -c from multiprocessing.resource_tracker import main;main(13)
10:29 0:39 /run/xinference/.venv/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=16) --multiprocessing-fork
Sl 10:30 0:01 /run/xinference/.venv/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=50) --multiprocessing-fork
S 10:30 0:00 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})

@codingl2k1 @qinxuye 大佬们知道咋改第一个上报错误吗

codingl2k1 · 2025-01-03T09:49:31Z

可以修改XINFERENCE_HEALTH_CHECK_TIMEOUT环境变量设置默认的check health的timeout，目前默认是10s。但是第一次超时还是很诡异，最好能用py-spy看一下worker进程的栈，看看到底是卡到哪里了。如果不方便查，也可以在npu那块儿加点日志看看时间是多少。

XprobeBot added the gpu label Dec 30, 2024

XprobeBot added this to the v1.x milestone Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

启动xinference平台报错，可能是异步超时 #2723

启动xinference平台报错，可能是异步超时 #2723

feifei05 commented Dec 30, 2024

codingl2k1 commented Dec 30, 2024

feifei05 commented Dec 31, 2024

codingl2k1 commented Jan 2, 2025 •

edited

Loading

qinxuye commented Jan 2, 2025

feifei05 commented Jan 3, 2025

feifei05 commented Jan 3, 2025

codingl2k1 commented Jan 3, 2025

启动xinference平台报错，可能是异步超时 #2723

启动xinference平台报错，可能是异步超时 #2723

Comments

feifei05 commented Dec 30, 2024

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

codingl2k1 commented Dec 30, 2024

feifei05 commented Dec 31, 2024

codingl2k1 commented Jan 2, 2025 • edited Loading

qinxuye commented Jan 2, 2025

feifei05 commented Jan 3, 2025

feifei05 commented Jan 3, 2025

codingl2k1 commented Jan 3, 2025

codingl2k1 commented Jan 2, 2025 •

edited

Loading