[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design #11672

llsj14 · 2025-01-01T16:11:22Z

Summary

disable norm: Removed input layer normalization and output normalization.
add residual: Added a residual path at the end of the Llama model. This was omitted because output normalization was disabled in the first step.

Experiment

Below are the experimental results from the above trials.

model: Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B with K=1
dataset: MT-Bench
input/output length: 128/128
sampling setting:
- multinomial: top_k=-1, top_p=1.0, temp=1.0
- greedy: top_k=1, top_p=1.0, temp=1.0

Approach	Accept Rate (Multinomial)	Accept Rate (Greedy)
as-is	0.131	0.308
1 (disable norm)	0.391	0.529
1+2 (disable norm + add residual)	0.565	0.619

Additional Experiment

model: Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B with different K
dataset: c4 / MT-Bench
input/output length: 1024/128(c4), 128/128(MT-Bench)
sampling setting:
- multinomial: top_k=-1, top_p=1.0, temp=1.0
- greedy: top_k=1, top_p=1.0, temp=1.0

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

		c4		mt-bench
	avg. accept rate	Multinomial	Greedy	Multinomial	Greedy
K=1	as-is	0.103	0.302	0.136	0.302
	to-be	0.393	0.508	0.58	0.607
K=2	as-is	0.101	0.209	0.112	0.229
	to-be	0.329	0.416	0.486	0.543
K=3	as-is	0.111	0.167	0.117	0.192
	to-be	0.306	0.36	0.437	0.495

Meta-Llama-3-8B-Instruct / EAGLE-LLaMA3-Instruct-8B

		c4		mt-bench
	avg. accept rate	Multinomial	Greedy	Multinomial	Greedy
K=1	as-is	0.115	0.441	0.156	0.561
	to-be	0.412	0.473	0.56	0.593
K=2	as-is	0.121	0.286	0.143	0.376
	to-be	0.339	0.365	0.449	0.468
K=3	as-is	0.138	0.236	0.152	0.301
	to-be	0.304	0.307	0.399	0.414

github-actions · 2025-01-01T16:11:32Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/model_executor/models/llama.py

Signed-off-by: Sungjae Lee <[email protected]>

llsj14 · 2025-01-02T10:51:00Z

@LiuXiaoxuanPKU @sroy745 Could you check this PR please?

sroy745

LGTM.
Thanks for the fix.
@llsj14 I am wondering if there is any easy way to find the acceptance rate of the reference github implementation for the mt-benchmark dataset?

vllm/model_executor/models/eagle.py

Signed-off-by: Sungjae Lee <[email protected]>

llsj14 · 2025-01-03T06:41:00Z

@llsj14 I am wondering if there is any easy way to find the acceptance rate of the reference github implementation for the mt-benchmark dataset?

I need some advice on applying the MT-Bench dataset, as I’m unsure how to properly use the prompts, especially those with conversation_a and conversation_b in the MT-Bench. For my experiment, I only used the content of conversation_a and truncated it to an input length of 128. I suspect this might be causing discrepancies between my results and those reported in the paper(But I also have limited information about how they used the MT-Bench dataset.).

For benchmarking, I used llmperf, and below is the code snippet showing how I processed the MT-Bench prompts:

                        prompt_str = traindata[index]['conversation_a'][0]['content'] #MT-BENCH
                        #prompt_str = traindata[index]['text'] #for C4 dataset
                        encoded_token = tokenizer.encode(prompt_str)                                         
                        token_length = len(encoded_token)

                        if token_length >= mean_input_tokens:
                            index += 1
                            encoded_token = encoded_token[:mean_input_tokens]
                            prompt_str = tokenizer.decode(encoded_token)
                            prompt = (prompt_str, get_token_length(prompt_str))
                            break

To compare my data with the reference, my colleague checked the acceptance rate using the EAGLE implementation and found that it has a similar acceptance rate to the C4 dataset with k=1 settings.

Do you have any feedback on this situation?

Signed-off-by: Sungjae Lee <[email protected]>

sroy745 · 2025-01-03T07:11:26Z

@llsj14 I am wondering if there is any easy way to find the acceptance rate of the reference github implementation for the mt-benchmark dataset?

I need some advice on applying the MT-Bench dataset, as I’m unsure how to properly use the prompts, especially those with conversation_a and conversation_b in the MT-Bench. For my experiment, I only used the content of conversation_a and truncated it to an input length of 128. I suspect this might be causing discrepancies between my results and those reported in the paper(But I also have limited information about how they used the MT-Bench dataset.).

For benchmarking, I used llmperf, and below is the code snippet showing how I processed the MT-Bench prompts:
                        prompt_str = traindata[index]['conversation_a'][0]['content'] #MT-BENCH
                        #prompt_str = traindata[index]['text'] #for C4 dataset
                        encoded_token = tokenizer.encode(prompt_str)                                         
                        token_length = len(encoded_token)

                        if token_length >= mean_input_tokens:
                            index += 1
                            encoded_token = encoded_token[:mean_input_tokens]
                            prompt_str = tokenizer.decode(encoded_token)
                            prompt = (prompt_str, get_token_length(prompt_str))
                            break
To compare my data with the reference, my colleague checked the acceptance rate using the EAGLE implementation and found that it has a similar acceptance rate to the C4 dataset with k=1 settings.

Do you have any feedback on this situation?

Thanks for the update. Good to know the acceptance rate for C4 is similar. One thing - wondering if it would be possible to run the vLLM benchmark with the ShareGPT dataset with and without the pr and see the improvement in TPOT ?

sroy745 · 2025-01-03T07:12:13Z

Applied the ready label to run all tests while we wait for reviews from others.

jeongin601 · 2025-01-03T07:32:52Z

@llsj14 I am wondering if there is any easy way to find the acceptance rate of the reference github implementation for the mt-benchmark dataset?

Hi,

The EAGLE GitHub repository customizes the original MT-Bench evaluation file (available here) to implement its evaluation process.You can find the customized MT-Bench evaluation file in the EAGLE repository here.

For the acceptance rate, the EAGLE implementation currently does not provide support for calculating the acceptance rate.

llsj14 · 2025-01-03T08:05:14Z

Thanks for the update. Good to know the acceptance rate for C4 is similar. One thing - wondering if it would be possible to run the vLLM benchmark with the ShareGPT dataset with and without the pr and see the improvement in TPOT ?

Sure, I’ll update the results. It might be helpful to understand the benchmark settings used by @LiuXiaoxuanPKU for the ShareGPT dataset in #9565, so you can better analyze the results. like input/output length settings, the process to extract prompts from the dataset..

llsj14 · 2025-01-03T11:36:54Z

I just did my experiment with SparseGPT dataset. The process of extracting prompts can be different from the experiment of @LiuXiaoxuanPKU in issue #9565

Experiment settings

model: Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B
device: A100
sampling parameter: multinomial (top_k=-1, top_p=1.0, temperature=1.0)
dataset: ShareGPT
average input length: 318 tokens
average output length: 128 tokens
the number of requests: 500

The part to extract prompts

traindata = load_dataset("json",
data_files="https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json",
split='train')

conversations = traindata[index].get('conversations', []) # sharegpt
if conversations:
    prompt_str = conversations[0].get('value', 'default_value') # sharegpt
else:
    index += 1
    continue

Experiment results

The results showed improvements in acceptance rate and speed-up metrics compared to the results w/o this PR.
However, they all show slower performance than vanilla case which doesn't utilize speculative decoding.

		accept rate	system efficiency	inter token latency(mean) [s]	total elapsed time [s]	speed up (vs vanilla)	speed up (vs as-is)
	vanilla	-	-	0.02	53.64	-	-
K=1	as-is	0.112	0.556	0.036	81.83	0.66	-
	to-be	0.56	0.78	0.029	66.58	0.81	1.23
K=2	as-is	0.106	0.374	0.043	95.59	0.56	-
	to-be	0.476	0.602	0.03	68.45	0.78	1.40
K=3	as-is	0.115	0.28	0.047	107.01	0.50	-
	to-be	0.43	0.477	0.033	74.35	0.72	1.44

Lin-Qingyang-Alec · 2025-01-04T02:38:40Z

Maybe you could use nn.Identity replace of DummyInputLayerNorm

llsj14 · 2025-01-06T01:12:02Z

Maybe you could use nn.Identity replace of DummyInputLayerNorm

Thank you for the recommendation, @Lin-Qingyang-Alec.

I didn’t know about nn.Identity, which behaves exactly like DummyInputLayerNorm in my code. However, I think it’s not a bad idea to use both DummyInputLayerNorm and DummyOutputNorm. I was looking for an equivalent implementation to DummyOutputNorm that performs residual operations, but I couldn’t find a similar one in the nn library.
If I need to change it, I’d prefer to modify both at the same time.

llsj14 mentioned this pull request Jan 1, 2025

[Performance]: vllm Eagle performance is worse than expected #9565

Open

1 task

llsj14 commented Jan 1, 2025

View reviewed changes

vllm/model_executor/models/llama.py Outdated Show resolved Hide resolved

llsj14 force-pushed the fix/eagle branch from 04cde86 to bb003af Compare January 1, 2025 17:09

llsj14 added 4 commits January 1, 2025 17:10

fix: disable input layernorm and output norm

d9420dd

Signed-off-by: Sungjae Lee <[email protected]>

fix: add residual path

60f863e

Signed-off-by: Sungjae Lee <[email protected]>

remove modification on llama model

c435851

Signed-off-by: Sungjae Lee <[email protected]>

make format

aa183ff

Signed-off-by: Sungjae Lee <[email protected]>

llsj14 force-pushed the fix/eagle branch from bb003af to aa183ff Compare January 1, 2025 17:11

LiuXiaoxuanPKU self-assigned this Jan 2, 2025

sroy745 reviewed Jan 3, 2025

View reviewed changes

vllm/model_executor/models/eagle.py Outdated Show resolved Hide resolved

sroy745 reviewed Jan 3, 2025

View reviewed changes

vllm/model_executor/models/eagle.py Show resolved Hide resolved

fix comment and typo

f2751c8

Signed-off-by: Sungjae Lee <[email protected]>

llsj14 added 2 commits January 3, 2025 06:42

fix typo

300f58c

Signed-off-by: Sungjae Lee <[email protected]>

revert updating metric part

0c8d357

Signed-off-by: Sungjae Lee <[email protected]>

sroy745 added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design #11672

[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design #11672

llsj14 commented Jan 1, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 1, 2025

llsj14 commented Jan 2, 2025

sroy745 left a comment •

edited

Loading

llsj14 commented Jan 3, 2025

sroy745 commented Jan 3, 2025

sroy745 commented Jan 3, 2025

jeongin601 commented Jan 3, 2025

llsj14 commented Jan 3, 2025 •

edited

Loading

llsj14 commented Jan 3, 2025 •

edited

Loading

Lin-Qingyang-Alec commented Jan 4, 2025

llsj14 commented Jan 6, 2025

[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design #11672

Are you sure you want to change the base?

[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design #11672

Conversation

llsj14 commented Jan 1, 2025 • edited by github-actions bot Loading

Summary

Experiment

Additional Experiment

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

Meta-Llama-3-8B-Instruct / EAGLE-LLaMA3-Instruct-8B

github-actions bot commented Jan 1, 2025

llsj14 commented Jan 2, 2025

sroy745 left a comment • edited Loading

Choose a reason for hiding this comment

llsj14 commented Jan 3, 2025

sroy745 commented Jan 3, 2025

sroy745 commented Jan 3, 2025

jeongin601 commented Jan 3, 2025

llsj14 commented Jan 3, 2025 • edited Loading

llsj14 commented Jan 3, 2025 • edited Loading

Experiment settings

The part to extract prompts

Experiment results

Lin-Qingyang-Alec commented Jan 4, 2025

llsj14 commented Jan 6, 2025

llsj14 commented Jan 1, 2025 •

edited by github-actions bot

Loading

sroy745 left a comment •

edited

Loading

llsj14 commented Jan 3, 2025 •

edited

Loading

llsj14 commented Jan 3, 2025 •

edited

Loading