-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
启动xinference平台报错,可能是异步超时 #2723
Comments
看着是2s没返回结果timeout了。有没有哪个进程CPU 100%? |
|
如果能py-spy看一下这个出错的worker进程调用栈就好了,也可能是torch npu的模块hang住了。一般查个资源占用不应该这么久都无法返回。 @qinxuye 这块儿关掉这个npu资源上报会有啥影响吗? |
这个是你自己修改的?开源没有ascend信息获取。 |
自己新增的npu信息,写了判断 |
我们把超时时间改成10s, async with timeout(10): 我们把deploy文件夹的local.py代码里面启动方法改成了"spawn",会有什么风险吗?是不是有影响? 这是deploy/local.py 这是输出的报错 During handling of the above exception, another exception occurred: Traceback (most recent call last): 这是进程 @codingl2k1 @qinxuye 大佬们知道咋改第一个上报错误吗 |
可以修改XINFERENCE_HEALTH_CHECK_TIMEOUT环境变量设置默认的check health的timeout,目前默认是10s。但是第一次超时还是很诡异,最好能用py-spy看一下worker进程的栈,看看到底是卡到哪里了。如果不方便查,也可以在npu那块儿加点日志看看时间是多少。 |
System Info / 系統信息
python 3.10.12
linux
NPU
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
1.1.0
The command used to start Xinference / 用以启动 xinference 的命令
XINFERENCE_MODEL_SRC=modelscope nohup xinference-local --host 123.12.1.123 --port 39997 > xinference_start.log 2>&1 &
Reproduction / 复现过程
在resource.py增加使用昇腾显卡计算npu已使用显存的代码,gather_node_info函数返回的node_resource能打印出来, 然后在worker.py一直报错在代码in report_status
async with timeout(2)处报错Report status got error. 下面是输出的node_resource和报错
node_resource {'cpu': ResourceStatus(usage=0.008, total=192, memory_used=46002270208, memory_available=1115963953152, memory_total=1622527479808), 'gpu-0': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-1': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-2': GPUStatus(mem_total=64424509440.0, mem_free=15461882265.599998, mem_used=48962627174.4), 'gpu-3': GPUStatus(mem_total=64424509440.0, mem_free=26414048870.4, mem_used=38010460569.6), 'gpu-4': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-5': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-6': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-7': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996)}
报错
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1155, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1154, in report_status
async with timeout(2):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2024-12-30 18:06:14,038 xinference.core.worker 2544354 ERROR Report status got error.
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1155, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1154, in report_status
async with timeout(2):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2024-12-30 18:06:21,042 xinference.core.worker 2544354 ERROR Report status got error.
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1155, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1154, in report_status
async with timeout(2):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
Expected behavior / 期待表现
希望您能正常启动平台。
The text was updated successfully, but these errors were encountered: