Some questions about vllm attention and profile_run() #11227
Replies: 1 comment
-
Regarding the issue of excessively long running time in the profile_run() stage, my solution is to enable prefill_chunk by default. This option is enabled when backend is cuda and running model with huge model_seq_len. The default model_seq_len of Llama3.3-70b is around 130000, which is too huge. And actually the time consumption in the device side is minimal and can be ignored. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I just ran LLaMA 3.3-70B on our own backend, and I found that the
profile_run()
function really took a long time. After further research, it turned out that the call to the attention mechanism took the most time in this stage.In our current implementation, we invoke the SDPA attention multiple times. The input
query
to SDPA has the shape[num_head, num_token, head_dim]
, while the original data passed by the vLLM framework has the shape[num_token, num_head, head_dim]
. Therefore, we have to perform amove_dim()
operation before invoking the SDPA operator and anothermove_dim()
operation at the end to restore it to its original shape. These operations cause significant performance loss.So, I would like to ask what the main optimization methods in the current industry are. Should we merge the above operations into one operator through fusion operators, or perhaps pass in data in the shape of
[num_head, num_token, head_dim]
when runningprofile_run()
?Beta Was this translation helpful? Give feedback.
All reactions