Some questions about vllm attention and profile_run() #11227

superZWT123 · 2024-12-16T07:46:41Z

superZWT123
Dec 16, 2024

I just ran LLaMA 3.3-70B on our own backend, and I found that the profile_run() function really took a long time. After further research, it turned out that the call to the attention mechanism took the most time in this stage.

In our current implementation, we invoke the SDPA attention multiple times. The input query to SDPA has the shape [num_head, num_token, head_dim], while the original data passed by the vLLM framework has the shape [num_token, num_head, head_dim]. Therefore, we have to perform a move_dim() operation before invoking the SDPA operator and another move_dim() operation at the end to restore it to its original shape. These operations cause significant performance loss.

So, I would like to ask what the main optimization methods in the current industry are. Should we merge the above operations into one operator through fusion operators, or perhaps pass in data in the shape of [num_head, num_token, head_dim] when running profile_run()?

superZWT123 · 2024-12-26T07:25:58Z

superZWT123
Dec 26, 2024
Author

Regarding the issue of excessively long running time in the profile_run() stage, my solution is to enable prefill_chunk by default. This option is enabled when backend is cuda and running model with huge model_seq_len. The default model_seq_len of Llama3.3-70b is around 130000, which is too huge.

And actually the time consumption in the device side is minimal and can be ignored.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about vllm attention and profile_run() #11227

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Some questions about vllm attention and profile_run() #11227

superZWT123 Dec 16, 2024

Replies: 1 comment

superZWT123 Dec 26, 2024 Author

superZWT123
Dec 16, 2024

superZWT123
Dec 26, 2024
Author