[Bug]: Extremely slow inference speed when deploying with vLLM on 16 H100 GPUs according to instructions on DeepSeekV3 #11705

yonghenglh6 · 2025-01-03T05:25:02Z

Your current environment

The output of `python collect_env.py`

Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.11.10 (main, Oct  3 2024, 07:29:13) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.134-008.7.kangaroo.al8.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L20Z
GPU 1: NVIDIA L20Z
GPU 2: NVIDIA L20Z
GPU 3: NVIDIA L20Z
GPU 4: NVIDIA L20Z
GPU 5: NVIDIA L20Z
GPU 6: NVIDIA L20Z
GPU 7: NVIDIA L20Z

Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          130
On-line CPU(s) list:             0-129
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Processor
CPU family:                      6
Model:                           143
Thread(s) per core:              1
Core(s) per socket:              130
Socket(s):                       1
Stepping:                        8
BogoMIPS:                        4000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd avx512vbmi umip pku waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       6.1 MiB (130 instances)
L1i cache:                       4.1 MiB (130 instances)
L2 cache:                        260 MiB (130 instances)
L3 cache:                        105 MiB (1 instance)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-64
NUMA node1 CPU(s):               65-129
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.45.2
[pip3] triton==3.1.0
[conda] numpy                     1.26.3                   pypi_0    pypi
[conda] torch                     2.5.0+cpu                pypi_0    pypi
[conda] torchmetrics              1.0.3                    pypi_0    pypi
[conda] torchrec                  1.0.0+cpu                pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.6.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
�[4mGPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	CPU Affinity	NUMA Affinity	GPU NUMA ID�[0m
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	0-129	0-1		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	0-129	0-1		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	0-129	0-1		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	0-129	0-1		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	0-129	0-1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	0-129	0-1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	0-129	0-1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	0-129	0-1		N/A
NIC0	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	 X 	PHB	PHB	PHB	PHB	PHB	PHB	PHB				
NIC1	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	 X 	PHB	PHB	PHB	PHB	PHB	PHB				
NIC2	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	 X 	PHB	PHB	PHB	PHB	PHB				
NIC3	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	 X 	PHB	PHB	PHB	PHB				
NIC4	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	 X 	PHB	PHB	PHB				
NIC5	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	 X 	PHB	PHB				
NIC6	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	 X 	PHB				
NIC7	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	PHB	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

NVIDIA_VISIBLE_DEVICES=all
NCCL_IB_TC=16
NCCL_MIN_NCHANNELS=4
NCCL_NET_PLUGIN=none
NCCL_SOCKET_IFNAME=eth
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=INFO
NCCL_IB_HCA=mlx5
NCCL_IB_GID_INDEX=3
NCCL_IB_QPS_PER_CONNECTION=8
NCCL_IB_TIMEOUT=22
NCCL_IB_SL=5
LD_LIBRARY_PATH=/cpfs/user/chenge/miniconda3/envs/vllm/lib/python3.11/site-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

No response

🐛 Describe the bug

I'm deploying the model using the following command:

vllm serve local_deepseekv3_path --trust-remote-code --tensor-parallel-size 8 --pipeline-parallel-size 2 --model-max-len 16384 --served-model-name deepseek-v3 deepseek

I'm using the official Ray example, and NCCL is enabled. After launching the model with the above command, the inference speed is extremely slow.
The inference speed is almost 5 times slower than an unquantized Qwen-72B model.

INFO: 10.39.129.93:36766 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 01-02 16:16:48 async_llm_engine.py:211] Added request chatcmpl-bc1d5239d4c743aabedf1249038b99da.
INFO 01-02 16:16:56 metrics.py:467] Avg prompt throughput: 1.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 01-02 16:17:02 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.INFO 01-02 16:17:07 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 01-02 16:17:10 async_llm_engine.py:179] Finished request chatcmpl-bc1d5239d4c743aabedf1249038b99da.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

LaoZhang-best · 2025-01-04T05:48:22Z

problem been solved? I have encountered it as well, 2 * 8H20

sander-1105 · 2025-01-09T09:55:42Z

I encountered the same issue while performing inference training on 16 NVIDIA H100 80GB HBM3 GPUs.

fan-niu · 2025-01-09T13:01:41Z

Same issue. I used 16 H100 GPUs, set TP=16, deployed using ray in k8s, and opened the IB network. I made a simple curl request, input 10 tokens, and output 242 tokens. This curl test It took 44 seconds. Can anyone help me figure out why?

NVIDIA-SMI 550.90.07
Driver Version: 550.90.07
CUDA Version: 12.4

curl example:
curl -X POST http://127.0.0.1:8081/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "deepseek_v3", "messages": [ { "role": "user", "content": "Describe this image in detail please." } ], "stream": false }'

response time:
real 0m44.194s user 0m0.002s sys 0m0.006s

response:
{"id":"chatcmpl-c0da88a82e784680a97742a9820d4c3c","object":"chat.completion","created":1736426977,"model":"deepseek_v3","choices":[{"index":0,"message":{"role":"assistant","content":"Since I cannot see or process images, I’ll create a detailed description for you based on your request. Here’s a vivid description of an imagined scene:\n\n*The image depicts a serene autumn landscape. A winding river sparkles under the golden sunlight, reflecting the colors of the surrounding forest. The trees are adorned with vibrant hues—rich reds, oranges, and yellows, signaling the peak of the autumn season. A few leaves are gently falling, dancing in the breeze before landing softly on the grassy banks of the river. In the distance, a small wooden bridge arches over the water, connecting the two sides of the forest. The sky above is a blend of soft blues and wispy clouds, hinting at a crisp, cool day. A lone hiker, dressed in warm attire, walks along a narrow path near the river, carrying a backpack and a sense of quiet determination. The overall atmosphere is one of tranquility and natural beauty, inviting the viewer to step into the scene and breathe in the crisp autumn air.* \n\nIf you have a specific image or details you’d like me to focus on, please provide some context, and I’ll adjust the description accordingly!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":252,"completion_tokens":242,"prompt_tokens_details":null},"prompt_logprobs":null}

xueshuai0922 · 2025-01-13T08:23:38Z

What gpu resources will deepseek v3 run smoothly

binxuan · 2025-01-15T04:10:58Z

Met similar issue when running on 8xH200 with below speed

27/10411 [1:02:17<259:27:32, 89.95s/it, est. speed input: 2.51 toks/s, output: 56.67 toks/s]

yonghenglh6 added the bug Something isn't working label Jan 3, 2025

yonghenglh6 changed the title ~~[Bug]: Extremely slow inference speed when deploying with vLLM on 16 H100 GPUs according to instructions~~ [Bug]: Extremely slow inference speed when deploying with vLLM on 16 H100 GPUs according to instructions on DeepSeekV3 Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Extremely slow inference speed when deploying with vLLM on 16 H100 GPUs according to instructions on DeepSeekV3 #11705

[Bug]: Extremely slow inference speed when deploying with vLLM on 16 H100 GPUs according to instructions on DeepSeekV3 #11705

yonghenglh6 commented Jan 3, 2025

LaoZhang-best commented Jan 4, 2025

sander-1105 commented Jan 9, 2025

fan-niu commented Jan 9, 2025

xueshuai0922 commented Jan 13, 2025

binxuan commented Jan 15, 2025

[Bug]: Extremely slow inference speed when deploying with vLLM on 16 H100 GPUs according to instructions on DeepSeekV3 #11705

[Bug]: Extremely slow inference speed when deploying with vLLM on 16 H100 GPUs according to instructions on DeepSeekV3 #11705

Comments

yonghenglh6 commented Jan 3, 2025

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

LaoZhang-best commented Jan 4, 2025

sander-1105 commented Jan 9, 2025

fan-niu commented Jan 9, 2025

xueshuai0922 commented Jan 13, 2025

binxuan commented Jan 15, 2025