Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design #11672

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

llsj14
Copy link
Contributor

@llsj14 llsj14 commented Jan 1, 2025

Summary

#9565
#11126

  1. disable norm: Removed input layer normalization and output normalization.
  2. add residual: Added a residual path at the end of the Llama model. This was omitted because output normalization was disabled in the first step.

Experiment

Below are the experimental results from the above trials.

  • model: Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B with K=1
  • dataset: MT-Bench
  • input/output length: 128/128
  • sampling setting:
    • multinomial: top_k=-1, top_p=1.0, temp=1.0
    • greedy: top_k=1, top_p=1.0, temp=1.0
Approach Accept Rate (Multinomial) Accept Rate (Greedy)
as-is 0.131 0.308
1 (disable norm) 0.391 0.529
1+2 (disable norm + add residual) 0.565 0.619

Additional Experiment

  • model: Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B with different K
  • dataset: c4 / MT-Bench
  • input/output length: 1024/128(c4), 128/128(MT-Bench)
  • sampling setting:
    • multinomial: top_k=-1, top_p=1.0, temp=1.0
    • greedy: top_k=1, top_p=1.0, temp=1.0

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

    c4   mt-bench  
  avg. accept rate Multinomial Greedy Multinomial Greedy
K=1 as-is 0.103 0.302 0.136 0.302
  to-be 0.393 0.508 0.58 0.607
K=2 as-is 0.101 0.209 0.112 0.229
  to-be 0.329 0.416 0.486 0.543
K=3 as-is 0.111 0.167 0.117 0.192
  to-be 0.306 0.36 0.437 0.495

Meta-Llama-3-8B-Instruct / EAGLE-LLaMA3-Instruct-8B

    c4   mt-bench  
  avg. accept rate Multinomial Greedy Multinomial Greedy
K=1 as-is 0.115 0.441 0.156 0.561
  to-be 0.412 0.473 0.56 0.593
K=2 as-is 0.121 0.286 0.143 0.376
  to-be 0.339 0.365 0.449 0.468
K=3 as-is 0.138 0.236 0.152 0.301
  to-be 0.304 0.307 0.399 0.414

Copy link

github-actions bot commented Jan 1, 2025

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@llsj14
Copy link
Contributor Author

llsj14 commented Jan 2, 2025

@LiuXiaoxuanPKU @sroy745 Could you check this PR please?

@LiuXiaoxuanPKU LiuXiaoxuanPKU self-assigned this Jan 2, 2025
Copy link
Collaborator

@sroy745 sroy745 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thanks for the fix.
@llsj14 I am wondering if there is any easy way to find the acceptance rate of the reference github implementation for the mt-benchmark dataset?

vllm/model_executor/models/eagle.py Outdated Show resolved Hide resolved
Signed-off-by: Sungjae Lee <[email protected]>
@llsj14
Copy link
Contributor Author

llsj14 commented Jan 3, 2025

@llsj14 I am wondering if there is any easy way to find the acceptance rate of the reference github implementation for the mt-benchmark dataset?

I need some advice on applying the MT-Bench dataset, as I’m unsure how to properly use the prompts, especially those with conversation_a and conversation_b in the MT-Bench. For my experiment, I only used the content of conversation_a and truncated it to an input length of 128. I suspect this might be causing discrepancies between my results and those reported in the paper(But I also have limited information about how they used the MT-Bench dataset.).

For benchmarking, I used llmperf, and below is the code snippet showing how I processed the MT-Bench prompts:

                        prompt_str = traindata[index]['conversation_a'][0]['content'] #MT-BENCH
                        #prompt_str = traindata[index]['text'] #for C4 dataset
                        encoded_token = tokenizer.encode(prompt_str)                                         
                        token_length = len(encoded_token)

                        if token_length >= mean_input_tokens:
                            index += 1
                            encoded_token = encoded_token[:mean_input_tokens]
                            prompt_str = tokenizer.decode(encoded_token)
                            prompt = (prompt_str, get_token_length(prompt_str))
                            break

To compare my data with the reference, my colleague checked the acceptance rate using the EAGLE implementation and found that it has a similar acceptance rate to the C4 dataset with k=1 settings.

Do you have any feedback on this situation?

llsj14 added 2 commits January 3, 2025 06:42
Signed-off-by: Sungjae Lee <[email protected]>
Signed-off-by: Sungjae Lee <[email protected]>
@sroy745
Copy link
Collaborator

sroy745 commented Jan 3, 2025

@llsj14 I am wondering if there is any easy way to find the acceptance rate of the reference github implementation for the mt-benchmark dataset?

I need some advice on applying the MT-Bench dataset, as I’m unsure how to properly use the prompts, especially those with conversation_a and conversation_b in the MT-Bench. For my experiment, I only used the content of conversation_a and truncated it to an input length of 128. I suspect this might be causing discrepancies between my results and those reported in the paper(But I also have limited information about how they used the MT-Bench dataset.).

For benchmarking, I used llmperf, and below is the code snippet showing how I processed the MT-Bench prompts:

                        prompt_str = traindata[index]['conversation_a'][0]['content'] #MT-BENCH
                        #prompt_str = traindata[index]['text'] #for C4 dataset
                        encoded_token = tokenizer.encode(prompt_str)                                         
                        token_length = len(encoded_token)

                        if token_length >= mean_input_tokens:
                            index += 1
                            encoded_token = encoded_token[:mean_input_tokens]
                            prompt_str = tokenizer.decode(encoded_token)
                            prompt = (prompt_str, get_token_length(prompt_str))
                            break

To compare my data with the reference, my colleague checked the acceptance rate using the EAGLE implementation and found that it has a similar acceptance rate to the C4 dataset with k=1 settings.

Do you have any feedback on this situation?

Thanks for the update. Good to know the acceptance rate for C4 is similar. One thing - wondering if it would be possible to run the vLLM benchmark with the ShareGPT dataset with and without the pr and see the improvement in TPOT ?

@sroy745 sroy745 added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 3, 2025
@sroy745
Copy link
Collaborator

sroy745 commented Jan 3, 2025

Applied the ready label to run all tests while we wait for reviews from others.

@jeongin601
Copy link
Contributor

@llsj14 I am wondering if there is any easy way to find the acceptance rate of the reference github implementation for the mt-benchmark dataset?

Hi,

The EAGLE GitHub repository customizes the original MT-Bench evaluation file (available here) to implement its evaluation process.You can find the customized MT-Bench evaluation file in the EAGLE repository here.

For the acceptance rate, the EAGLE implementation currently does not provide support for calculating the acceptance rate.

@llsj14
Copy link
Contributor Author

llsj14 commented Jan 3, 2025

Thanks for the update. Good to know the acceptance rate for C4 is similar. One thing - wondering if it would be possible to run the vLLM benchmark with the ShareGPT dataset with and without the pr and see the improvement in TPOT ?

Sure, I’ll update the results. It might be helpful to understand the benchmark settings used by @LiuXiaoxuanPKU for the ShareGPT dataset in #9565, so you can better analyze the results. like input/output length settings, the process to extract prompts from the dataset..

@llsj14
Copy link
Contributor Author

llsj14 commented Jan 3, 2025

I just did my experiment with SparseGPT dataset. The process of extracting prompts can be different from the experiment of @LiuXiaoxuanPKU in issue #9565

Experiment settings

  • model: Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B
  • device: A100
  • sampling parameter: multinomial (top_k=-1, top_p=1.0, temperature=1.0)
  • dataset: ShareGPT
  • average input length: 318 tokens
  • average output length: 128 tokens
  • the number of requests: 500

The part to extract prompts

traindata = load_dataset("json",
data_files="https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json",
split='train')

conversations = traindata[index].get('conversations', []) # sharegpt
if conversations:
    prompt_str = conversations[0].get('value', 'default_value') # sharegpt
else:
    index += 1
    continue

Experiment results

  • The results showed improvements in acceptance rate and speed-up metrics compared to the results w/o this PR.
  • However, they all show slower performance than vanilla case which doesn't utilize speculative decoding.
    accept rate system efficiency inter token latency(mean) [s] total elapsed time [s] speed up (vs vanilla) speed up (vs as-is)
  vanilla - - 0.02 53.64 - -
K=1 as-is 0.112 0.556 0.036 81.83 0.66 -
  to-be 0.56 0.78 0.029 66.58 0.81 1.23
K=2 as-is 0.106 0.374 0.043 95.59 0.56 -
  to-be 0.476 0.602 0.03 68.45 0.78 1.40
K=3 as-is 0.115 0.28 0.047 107.01 0.50 -
  to-be 0.43 0.477 0.033 74.35 0.72 1.44

@Lin-Qingyang-Alec
Copy link

Maybe you could use nn.Identity replace of DummyInputLayerNorm

@llsj14
Copy link
Contributor Author

llsj14 commented Jan 6, 2025

Maybe you could use nn.Identity replace of DummyInputLayerNorm

Thank you for the recommendation, @Lin-Qingyang-Alec.

I didn’t know about nn.Identity, which behaves exactly like DummyInputLayerNorm in my code. However, I think it’s not a bad idea to use both DummyInputLayerNorm and DummyOutputNorm. I was looking for an equivalent implementation to DummyOutputNorm that performs residual operations, but I couldn’t find a similar one in the nn library.
If I need to change it, I’d prefer to modify both at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants