We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda 12.2
1.1.0
xinference launch --model-name qwen2.5-instruct \ > --model-type LLM \ > --model-uid Qwen1_5B \ > --model_path /models/Qwen/Qwen2___5-1___5B-Instruct \ > --model-engine 'vllm' \ > --model-format 'pytorch' \ > --quantization None \ > --n-gpu 1\ > --gpu-idx "0" \ > --tensor_parallel_size 1 \ > --gpu_memory_utilization 0.30 \ > --max_model_len 4096
xinference launch --model-name qwen2.5-instruct \ > --model-type LLM \ > --model-uid Qwen1_5B \ > --model_path /models/Qwen/Qwen2___5-1___5B-Instruct \ > --model-engine 'vllm' \ > --model-format 'pytorch' \ > --quantization None \ > --n-gpu 1\ > --gpu-idx "0" \ > --tensor_parallel_size 1 \ > --gpu_memory_utilization 0.30 \ > --max_model_len 4096 Launch model name: qwen2.5-instruct with kwargs: {'model_path': '/models/Qwen/Qwen2___5-1___5B-Instruct', 'tensor_parallel_size': 1, 'gpu_memory_utilization': 0.3, 'max_model_len': 4096} Traceback (most recent call last): File "/usr/local/bin/xinference", line 8, in <module> sys.exit(cli()) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/cmdline.py", line 908, in model_launch model_uid = client.launch_model( File "/usr/local/lib/python3.10/dist-packages/xinference/client/restful/restful_client.py", line 999, in launch_model raise RuntimeError( RuntimeError: Failed to launch model, detail: [address=0.0.0.0:26194, pid=237] User specified GPU index 0 has been occupied with a vLLM model: Qwen0_5B-0, therefore cannot allocate GPU memory for a new model.
在显存允许的范围内,单卡可以加载多个模型
The text was updated successfully, but these errors were encountered:
This issue is stale because it has been open for 7 days with no activity.
Sorry, something went wrong.
This issue was closed because it has been inactive for 5 days since being marked as stale.
No branches or pull requests
System Info / 系統信息
cuda 12.2
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
1.1.0
The command used to start Xinference / 用以启动 xinference 的命令
Reproduction / 复现过程
Expected behavior / 期待表现
在显存允许的范围内,单卡可以加载多个模型
The text was updated successfully, but these errors were encountered: