-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Extremely slow inference speed when deploying with vLLM on 16 H100 GPUs according to instructions on DeepSeekV3 #11705
Comments
problem been solved? I have encountered it as well, 2 * 8H20 |
I encountered the same issue while performing inference training on 16 NVIDIA H100 80GB HBM3 GPUs. |
Same issue. I used 16 H100 GPUs, set TP=16, deployed using ray in k8s, and opened the IB network. I made a simple curl request, input 10 tokens, and output 242 tokens. This curl test It took 44 seconds. Can anyone help me figure out why? NVIDIA-SMI 550.90.07 curl example: response time: response: |
What gpu resources will deepseek v3 run smoothly |
Met similar issue when running on 8xH200 with below speed
|
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
I'm deploying the model using the following command:
I'm using the official Ray example, and NCCL is enabled. After launching the model with the above command, the inference speed is extremely slow.
The inference speed is almost 5 times slower than an unquantized Qwen-72B model.
INFO: 10.39.129.93:36766 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 01-02 16:16:48 async_llm_engine.py:211] Added request chatcmpl-bc1d5239d4c743aabedf1249038b99da.
INFO 01-02 16:16:56 metrics.py:467] Avg prompt throughput: 1.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 01-02 16:17:02 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.INFO 01-02 16:17:07 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 01-02 16:17:10 async_llm_engine.py:179] Finished request chatcmpl-bc1d5239d4c743aabedf1249038b99da.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: