Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

启动xinference平台报错,可能是异步超时 #2723

Open
1 of 3 tasks
feifei05 opened this issue Dec 30, 2024 · 7 comments
Open
1 of 3 tasks

启动xinference平台报错,可能是异步超时 #2723

feifei05 opened this issue Dec 30, 2024 · 7 comments
Labels
Milestone

Comments

@feifei05
Copy link

System Info / 系統信息

python 3.10.12
linux
NPU

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

1.1.0

The command used to start Xinference / 用以启动 xinference 的命令

XINFERENCE_MODEL_SRC=modelscope nohup xinference-local --host 123.12.1.123 --port 39997 > xinference_start.log 2>&1 &

Reproduction / 复现过程

在resource.py增加使用昇腾显卡计算npu已使用显存的代码,gather_node_info函数返回的node_resource能打印出来, 然后在worker.py一直报错在代码in report_status
async with timeout(2)处报错Report status got error. 下面是输出的node_resource和报错

node_resource {'cpu': ResourceStatus(usage=0.008, total=192, memory_used=46002270208, memory_available=1115963953152, memory_total=1622527479808), 'gpu-0': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-1': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-2': GPUStatus(mem_total=64424509440.0, mem_free=15461882265.599998, mem_used=48962627174.4), 'gpu-3': GPUStatus(mem_total=64424509440.0, mem_free=26414048870.4, mem_used=38010460569.6), 'gpu-4': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-5': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-6': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996), 'gpu-7': GPUStatus(mem_total=64424509440.0, mem_free=60559038873.6, mem_used=3865470566.3999996)}

报错

Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1155, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1154, in report_status
async with timeout(2):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2024-12-30 18:06:14,038 xinference.core.worker 2544354 ERROR Report status got error.
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1155, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1154, in report_status
async with timeout(2):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2024-12-30 18:06:21,042 xinference.core.worker 2544354 ERROR Report status got error.
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1155, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1154, in report_status
async with timeout(2):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

resource1

Expected behavior / 期待表现

希望您能正常启动平台。

@XprobeBot XprobeBot added the gpu label Dec 30, 2024
@XprobeBot XprobeBot added this to the v1.x milestone Dec 30, 2024
@codingl2k1
Copy link
Contributor

看着是2s没返回结果timeout了。有没有哪个进程CPU 100%?

@feifei05
Copy link
Author

看着是2s没返回结果timeout了。有没有哪个进程CPU 100%?
用top查看了下,确实有个进程的cpu占了100%,然后如何处理,大佬

@codingl2k1
Copy link
Contributor

codingl2k1 commented Jan 2, 2025

看着是2s没返回结果timeout了。有没有哪个进程CPU 100%?
用top查看了下,确实有个进程的cpu占了100%,然后如何处理,大佬

如果能py-spy看一下这个出错的worker进程调用栈就好了,也可能是torch npu的模块hang住了。一般查个资源占用不应该这么久都无法返回。 @qinxuye 这块儿关掉这个npu资源上报会有啥影响吗?

@qinxuye
Copy link
Contributor

qinxuye commented Jan 2, 2025

这个是你自己修改的?开源没有ascend信息获取。

@feifei05
Copy link
Author

feifei05 commented Jan 3, 2025

这个是你自己修改的?开源没有ascend信息获取。

自己新增的npu信息,写了判断

@feifei05
Copy link
Author

feifei05 commented Jan 3, 2025

看着是2s没返回结果timeout了。有没有哪个进程CPU 100%?
用top查看了下,确实有个进程的cpu占了100%,然后如何处理,大佬

如果能py-spy看一下这个出错的worker进程调用栈就好了,也可能是torch npu的模块hang住了。一般查个资源占用不应该这么久都无法返回。 @qinxuye 这块儿关掉这个npu资源上报会有啥影响吗?

我们把超时时间改成10s, async with timeout(10):
status = await asyncio.to_thread(gather_node_info)
然后启动平台,能启动,但是在第一次上报状态信息会报错,后面就没有报错,这是啥原因?
为啥启动后多了multiprocessing.forkserver很多进程?之前正常启动平台没有这些进程。

我们把deploy文件夹的local.py代码里面启动方法改成了"spawn",会有什么风险吗?是不是有影响?
版本信息
Python 3.10.12
torch 2.1.0
torch-npu 2.1.0.post8
torchvision 0.16.0

这是deploy/local.py
def run_in_subprocess(
address: str,
metrics_exporter_host: Optional[str] = None,
metrics_exporter_port: Optional[int] = None,
logging_conf: Optional[Dict] = None,
) -> multiprocessing.Process:
multiprocessing.set_start_method(method="spawn", force=True)
p = multiprocessing.Process(
target=run,
args=(address, metrics_exporter_host, metrics_exporter_port, logging_conf),
)
p.start()
return p

这是输出的报错
线程开始时间: 1735871408.2499998
node_info获取开始时间: 1735871408.2505193
device_count: 8
2025-01-03 10:30:18,258 xinference.core.worker 4135310 ERROR Report status got error.
Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1167, in report_status
status = await asyncio.to_thread(gather_node_info)
File "/root/.pyenv/versions/3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/run/xinference/.venv/lib/python3.10/site-packages/xinference/core/worker.py", line 1166, in report_status
async with timeout(10):
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/run/xinference/.venv/lib/python3.10/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2025-01-03 10:30:51,578 xinference.core.supervisor 4135310 ERROR Worker timeout. address: 172.16.1.149:23192, check count remaining 4...
...
2025-01-03 10:31:36,614 xinference.core.supervisor 4135310 ERROR Worker timeout. address: 172.16.1.149:23192, check count remaining 4...

这是进程
root 4135309 0.0 0.0 225072 11276 ? S 10:29 0:00 /run/xinference/.venv/bin/python -c from multiprocessing.resource_tracker import main;main(13)
root 4135310 5.5 0.0 8603240124 1454612 ? Sl 10:29 0:39 /run/xinference/.venv/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=16) --multiprocessing-fork
root 4135809 0.2 0.0 1373344 90332 ? Sl 10:30 0:01 /run/xinference/.venv/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=14, pipe_handle=50) --multiprocessing-fork
root 4135870 0.0 0.0 226564 12212 ? S 10:30 0:00 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135871 0.3 0.0 760356 371800 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135872 0.3 0.0 760356 370692 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135873 0.3 0.0 760356 371096 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135874 0.3 0.0 760364 370868 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135875 0.3 0.0 760356 371684 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135876 0.3 0.0 760356 370384 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135877 0.3 0.0 760356 370868 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135878 0.3 0.0 760356 371272 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135879 0.3 0.0 760360 372020 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135889 0.3 0.0 760276 373780 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135890 0.3 0.0 760276 372152 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135891 0.3 0.0 760268 373576 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135892 0.3 0.0 760280 373840 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135893 0.3 0.0 760268 372488 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135894 0.3 0.0 760268 372788 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135895 0.3 0.0 760272 372008 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})
root 4135896 0.3 0.0 760348 372288 ? Sl 10:30 0:02 /run/xinference/.venv/bin/python -c from multiprocessing.forkserver import main; main(47, 49, [], **{})

@codingl2k1 @qinxuye 大佬们知道咋改第一个上报错误吗

@codingl2k1
Copy link
Contributor

可以修改XINFERENCE_HEALTH_CHECK_TIMEOUT环境变量设置默认的check health的timeout,目前默认是10s。但是第一次超时还是很诡异,最好能用py-spy看一下worker进程的栈,看看到底是卡到哪里了。如果不方便查,也可以在npu那块儿加点日志看看时间是多少。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants