-
Hi there! I'm new to vllm and studying it for a while, especially the part of lora. Let me expain what I have figured out first. My understandings about vllm codeIn the main func of multilora_inference.py, at the code
flows into "vllm/worker/model_runner.py" and enters line 1656 (Code block 1) Code Block 1 vllm/worker/model_runner.pyclass ModelRunner(...):
This leads me to the forward function (line 544) in vllm/model_executor/models/llama.py (Code block 2). Code Block 2 vllm/model_executor/models/llama.pyclass LlamaForCausalLM(...):
If we dive into self.model(~~~), we can find LlamaDecodeLayer.forward(), which calls LlamaAttentionForward.forward() for the attention calculation of llama and eventually calls qkv_proj with "MergedQKVParallelLinearWithLora.apply()" in vllm/lora/layers.py. Now we can see the LoRA calculation part in this function using the self.punica_wrapper.add_lora_packed_nslice(), and inside this function there are the shrink and expand methods for lora arithmetics. Code Block 3 vllm/lora/punica.pyclass PunicaWrapper:
This code flow works for the whole I guess that this happens in the engine initializing stage for the engine to check how much memory does the system has, or allocation of gpu memory blocks for paged attention. QuestionMy question starts from here. After initalizing the engine, multilora_inference.py example starts to serve requests. It creates test prompts(requests), and process those requests by using But from the second step (which means the start of the decode stage), Code block1 do not lead me to Code Block2, but instead goes to Code Block4 Code Block 4 vllm/worker/model_runner.pyClass CudaGraphRunner(~~):
With my investigation, Code Block 4 never goes to the punica.py file. But not only punica.py, this code never reaches the llama.py file either. I also digged up the code outside Code Block 1, like model sample function. Code Block 5 vllm/worker/model_runner.pyclass ModelRunner(...):
However, these other function including model.sample have no relationship with llama.py either, so now I am very confused about the decoding stage. Is the decoding stage actually running? Actually however, if I check the results of the multilora_inference.py, it shows new decoded tokens like this: RequestOutput(request_id=1, If I read the created tokens, it looks like it is the answer of the prompt message, making an SQL query starting with the keyword SELECT, so I do think the decoding is correctly working. Though I cannot understand how this is created without passing the llama.py module. What is the problem of my understanding? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
IIUC, Your confusion mainly stems from cudagraph . I suggest you first learn about cudagraph . Alternatively, you can try changing the code below and debug it.
|
Beta Was this translation helpful? Give feedback.
IIUC, Your confusion mainly stems from cudagraph . I suggest you first learn about cudagraph .
Alternatively, you can try changing the code below and debug it.