Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal error for batch inference: probability tensor contains either inf, nan or element < 0. #2728

Open
1 of 3 tasks
lukuanwang-delta opened this issue Dec 31, 2024 · 2 comments
Labels
Milestone

Comments

@lukuanwang-delta
Copy link

System Info / 系統信息

NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4
使用xinference:v1.1.1

报错如下:
2024-12-31 01:17:30,923 xinference.core.model 471 INFO     ModelActor(Llama-3.1-Nemotron-70B-Instruct-HF-0) loaded

2024-12-31 01:17:30,925 xinference.core.worker 109 INFO [request f449ca82-c757-11ef-b2ea-0242ac1a0002] Leave launch_builtin_model, elapsed time: 46 s
2024-12-31 01:17:44,027 transformers.generation.configuration_utils 471 INFO loading configuration file /root/llm/model/Llama-3.1-Nemotron-70B-Instruct-HF/generation_config.json
loading configuration file /root/llm/model/Llama-3.1-Nemotron-70B-Instruct-HF/generation_config.json
2024-12-31 01:17:44,027 transformers.generation.configuration_utils 471 INFO Generate config GenerationConfig {
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
]
}

Generate config GenerationConfig {
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
]
}

2024-12-31 01:17:57,391 transformers.models.llama.modeling_llama 471 WARNING We detected that you are passing past_key_values as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate Cache class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)
We detected that you are passing past_key_values as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate Cache class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)
2024-12-31 01:18:09,955 xinference.model.llm.transformers.utils 471 ERROR Internal error for batch inference: probability tensor contains either inf, nan or element < 0.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/utils.py", line 491, in batch_inference_one_step
_batch_inference_one_step_internal(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/utils.py", line 335, in _batch_inference_one_step_internal
token = _get_token_from_logits(
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/utils.py", line 111, in _get_token_from_logits
indices = torch.multinomial(probs, num_samples=2)
RuntimeError: probability tensor contains either inf, nan or element < 0
Destroy generator 17ad5836c75811efa6630242ac1a0002 due to an error encountered.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 419, in xoscar_next
r = await asyncio.create_task(_async_wrapper(gen))
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 409, in _async_wrapper
return await _gen.anext() # noqa: F821
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 485, in _to_async_gen
async for v in gen:
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 681, in _queue_consumer
raise RuntimeError(res[len(XINFERENCE_STREAMING_ERROR_FLAG) :])
RuntimeError: probability tensor contains either inf, nan or element < 0
2024-12-31 01:18:10,047 xinference.api.restful_api 1 ERROR Chat completion stream got an error: [address=0.0.0.0:44787, pid=471] probability tensor contains either inf, nan or element < 0
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 2072, in stream_results
async for item in iterator:
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 340, in anext
return await self._actor_ref.xoscar_next(self._uid)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 431, in xoscar_next
raise e
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 419, in xoscar_next
r = await asyncio.create_task(_async_wrapper(gen))
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 409, in _async_wrapper
return await _gen.anext() # noqa: F821
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 485, in _to_async_gen
async for v in gen:
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 681, in _queue_consumer
raise RuntimeError(res[len(XINFERENCE_STREAMING_ERROR_FLAG) :])
RuntimeError: [address=0.0.0.0:44787, pid=471] probability tensor contains either inf, nan or element < 0
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 527, in process_events
response = await route_utils.call_process_api(
File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 261, in call_process_api
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1786, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1350, in call_function
prediction = await utils.async_iteration(iterator)
File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 583, in async_iteration
return await iterator.anext()
File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 709, in asyncgen_wrapper
response = await iterator.anext()
File "/usr/local/lib/python3.10/dist-packages/gradio/chat_interface.py", line 545, in _stream_fn
first_response = await async_iteration(generator)
File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 583, in async_iteration
return await iterator.anext()
File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 576, in anext
return await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 859, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 559, in run_sync_iterator_async
return next(iterator)
File "/usr/local/lib/python3.10/dist-packages/xinference/core/chat_interface.py", line 126, in generate_wrapper
for chunk in model.chat(
File "/usr/local/lib/python3.10/dist-packages/xinference/client/common.py", line 51, in streaming_response_iterator
raise Exception(str(error))
Exception: [address=0.0.0.0:44787, pid=471] probability tensor contains either inf, nan or element < 0
2024-12-31 01:19:20,387 xinference.model.llm.transformers.utils 471 ERROR Internal error for batch inference: probability tensor contains either inf, nan or element < 0.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/utils.py", line 491, in batch_inference_one_step
_batch_inference_one_step_internal(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/utils.py", line 335, in _batch_inference_one_step_internal
token = _get_token_from_logits(
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/utils.py", line 111, in _get_token_from_logits
indices = torch.multinomial(probs, num_samples=2)
RuntimeError: probability tensor contains either inf, nan or element < 0
2024-12-31 01:19:20,420 xinference.core.model 471 ERROR [request 41af0c06-c758-11ef-a663-0242ac1a0002] Leave chat, error: probability tensor contains either inf, nan or element < 0, elapsed time: 25 s
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 94, in wrapped
ret = await func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 735, in chat
return await self.handle_batching_request(
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 718, in handle_batching_request
result = await fut
ValueError: probability tensor contains either inf, nan or element < 0
2024-12-31 01:19:20,424 xinference.api.restful_api 1 ERROR [address=0.0.0.0:44787, pid=471] probability tensor contains either inf, nan or element < 0
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 2098, in create_chat_completion
data = await model.chat(
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 102, in wrapped_func
ret = await fn(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 462, in _wrapper
r = await func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 94, in wrapped
ret = await func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 735, in chat
return await self.handle_batching_request(
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 718, in handle_batching_request
result = await fut
ValueError: [address=0.0.0.0:44787, pid=471] probability tensor contains either inf, nan or element < 0

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

docker image
v1.1.1

The command used to start Xinference / 用以启动 xinference 的命令

docker 容器启动
services:
xinference:
image: xprobe/xinference:v1.1.1
container_name: xinference
ports:
- "9997:9997"
volumes:
- /opt/xinference/.xinference:/root/.xinference/
- /opt/xinference/.cache:/root/.cache/
- /opt/llm/model/:/root/llm/model/
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
entrypoint: xinference-local
command: ["-H", "0.0.0.0"]
restart: always

Reproduction / 复现过程

  1. registry model,

  2. {
    "version": 1,
    "context_length": 2048,
    "model_name": "Llama-3.1-Nemotron-70B-Instruct-HF",
    "model_lang": [
    "en",
    "zh"
    ],
    "model_ability": [
    "chat"
    ],
    "model_description": "Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.",
    "model_family": "qwen2.5-instruct",
    "model_specs": [
    {
    "model_format": "pytorch",
    "model_size_in_billions": 70,
    "quantizations": [
    "none"
    ],
    "model_id": null,
    "model_hub": "huggingface",
    "model_uri": "/root/llm/model/Llama-3.1-Nemotron-70B-Instruct-HF",
    "model_revision": null
    }
    ],
    "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }}\n {%- for tool in tools %}\n {{- "\n" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- "\n\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": , \"arguments\": }\n</tool_call><|im_end|>\n" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}\n {%- else %}\n {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}\n {%- elif message.role == "assistant" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\n<tool_call>\n{"name": "' }}\n {{- tool_call.name }}\n {{- '", "arguments": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\n' }}\n {%- elif message.role == "tool" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\n<tool_response>\n' }}\n {{- message.content }}\n {{- '\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n {{- '<|im_end|>\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\n' }}\n{%- endif %}\n",
    "stop_token_ids": [
    151643,
    151644,
    151645
    ],
    "stop": [
    "<|endoftext|>",
    "<|im_start|>",
    "<|im_end|>"
    ],
    "is_builtin": false
    }

  3. 启动model

  4. chat model报错

Expected behavior / 期待表现

期待正常chat

@XprobeBot XprobeBot added the gpu label Dec 31, 2024
@XprobeBot XprobeBot added this to the v1.x milestone Dec 31, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Jan 3, 2025

加载模型的时候有开启量化吗?

@lukuanwang-delta
Copy link
Author

加载模型的时候有开启量化吗?
quantization为None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants