-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design #11672
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Sungjae Lee <[email protected]>
Signed-off-by: Sungjae Lee <[email protected]>
Signed-off-by: Sungjae Lee <[email protected]>
Signed-off-by: Sungjae Lee <[email protected]>
@LiuXiaoxuanPKU @sroy745 Could you check this PR please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks for the fix.
@llsj14 I am wondering if there is any easy way to find the acceptance rate of the reference github implementation for the mt-benchmark dataset?
Signed-off-by: Sungjae Lee <[email protected]>
I need some advice on applying the MT-Bench dataset, as I’m unsure how to properly use the prompts, especially those with For benchmarking, I used llmperf, and below is the code snippet showing how I processed the MT-Bench prompts: prompt_str = traindata[index]['conversation_a'][0]['content'] #MT-BENCH
#prompt_str = traindata[index]['text'] #for C4 dataset
encoded_token = tokenizer.encode(prompt_str)
token_length = len(encoded_token)
if token_length >= mean_input_tokens:
index += 1
encoded_token = encoded_token[:mean_input_tokens]
prompt_str = tokenizer.decode(encoded_token)
prompt = (prompt_str, get_token_length(prompt_str))
break To compare my data with the reference, my colleague checked the acceptance rate using the EAGLE implementation and found that it has a similar acceptance rate to the C4 dataset with k=1 settings. Do you have any feedback on this situation? |
Signed-off-by: Sungjae Lee <[email protected]>
Signed-off-by: Sungjae Lee <[email protected]>
Thanks for the update. Good to know the acceptance rate for C4 is similar. One thing - wondering if it would be possible to run the vLLM benchmark with the ShareGPT dataset with and without the pr and see the improvement in TPOT ? |
Applied the ready label to run all tests while we wait for reviews from others. |
Hi, The EAGLE GitHub repository customizes the original MT-Bench evaluation file (available here) to implement its evaluation process.You can find the customized MT-Bench evaluation file in the EAGLE repository here. For the acceptance rate, the EAGLE implementation currently does not provide support for calculating the acceptance rate. |
Sure, I’ll update the results. It might be helpful to understand the benchmark settings used by @LiuXiaoxuanPKU for the ShareGPT dataset in #9565, so you can better analyze the results. like input/output length settings, the process to extract prompts from the dataset.. |
I just did my experiment with SparseGPT dataset. The process of extracting prompts can be different from the experiment of @LiuXiaoxuanPKU in issue #9565 Experiment settings
The part to extract promptstraindata = load_dataset("json",
data_files="https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json",
split='train')
conversations = traindata[index].get('conversations', []) # sharegpt
if conversations:
prompt_str = conversations[0].get('value', 'default_value') # sharegpt
else:
index += 1
continue Experiment results
|
Maybe you could use nn.Identity replace of DummyInputLayerNorm |
Thank you for the recommendation, @Lin-Qingyang-Alec. I didn’t know about nn.Identity, which behaves exactly like DummyInputLayerNorm in my code. However, I think it’s not a bad idea to use both DummyInputLayerNorm and DummyOutputNorm. I was looking for an equivalent implementation to DummyOutputNorm that performs residual operations, but I couldn’t find a similar one in the nn library. |
Summary
#9565
#11126
Experiment
Below are the experimental results from the above trials.
Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B
withK=1
Additional Experiment
Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B
with differentK
Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B
Meta-Llama-3-8B-Instruct / EAGLE-LLaMA3-Instruct-8B