Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to exactly reproduce the results on the openllm leaderboard? #2583

Open
Zilinghan opened this issue Dec 19, 2024 · 6 comments
Open

How to exactly reproduce the results on the openllm leaderboard? #2583

Zilinghan opened this issue Dec 19, 2024 · 6 comments

Comments

@Zilinghan
Copy link

Hi,

I am trying to reproduce the results on the openllm leaderboard of huggingface, however, I found some inconsistency between the results generated by harness and the results shown on the leaderboard. For example, for meta-llama/Meta-Llama-3.1-70B-Instruct, its average bbh acc_norm is 0.6915466064919285 (see first attached figure) in the raw results generated by harness, but on the leaderboard, it is shown as 55.93%. I am wondering how the 55.93% is calculated from 0.6915466064919285, thanks!

image image

Similarly, for mmlu-pro, the raw results is 0.5309175531914894 but the leaderboard is shown as 47.88%.

@Ryuuranwlb
Copy link

Ryuuranwlb commented Dec 20, 2024

TL;DR: Try adjusting your batch_size, as it can influence evaluation results due to some unknown GPU-related optimization issues (not sure if it's caused by lm-eval itself). Some users have reported that the results are consistent when using CPUs for evaluation. Additionally, this batch_size problem appears not only in leaderboard datasets but also in other evaluation contexts.

The OpenLLM Leaderboard blog mentions that evaluation results can vary with different batch_size values due to padding-related issues. However, while padding has been addressed, some evaluation inconsistencies persist in certain cases. While the leaderboard blog mentions that the results may ‘vary slightly,’ our personal experience shows that evaluation scores with batch_size=1 and batch_size=8 could sometimes differ by as much as 10%, suggesting the issue could be more significant than described.

For further context, check related discussions:
Issue #1625
Issue #1645

@Zilinghan
Copy link
Author

Hi @Ryuuranwlb, I don't think the issue is related to batch size, as the raw results I shown is generated by huggingface, and there is a huge gap between the raw results 0.6915466 and the displayed results 55.93%

@Ryuuranwlb
Copy link

Ryuuranwlb commented Dec 22, 2024

Perhaps you could try further inspecting the settings related to apply_chat_template for Instruct models.

Also, note that batch_size causing significant discrepancies are possible (not knowing why). I ran the following lm_eval CLI command to evaluate the dataset leaderboard_ifeval(without chat template) on the model dolphin-2.9.2-Phi-3-Medium with batch_size=1 and batch_size=8 on an H800 GPU. The results differed astonishingly.

lm_eval \
    --model hf \
    --model_args pretrained=cognitivecomputations/dolphin-2.9.2-Phi-3-Medium,trust_remote_code=True \
    --tasks leaderboard_ifeval \
    --batch_size 1

Batch size 1 result:
图片

Batch size 8 result:
图片

@Zilinghan
Copy link
Author

But the problem is that it is not the inconsistency between my running and huggingface. It is the inconsistency between huggingface's raw results and huggingface's webpage percentage results. The raw results I shown is provided by huggingface available at here (You might need to give permission to this dataset to see the results).

@CandiedCode
Copy link
Contributor

CandiedCode commented Dec 30, 2024

It would also help if the avg is returned, so that no further calculation is needed to compare against the leaderboard.

For example llama reports "the average across all the scores."
ifeval

@cpwan
Copy link

cpwan commented Dec 31, 2024

Similarly, for 8B model.

  1. Cannot relate the huggingface reported json with the benchmark page.
  2. Discrepancy between my own runs with the huggingface reported one.

Benchmark on mmlu pro: 0.3068
Reported on mmlu pro: 0.3746
My results on mmlu pro: 0.1629

# my results are obtained from running vllm==0.6.3.post1, lm_eval==0.4.7
HF_TOKEN=xxxxx OPENAI_API_KEY=xxxxx --model local-completions --tasks leaderboard \
--model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://locally_hosted_vllm/v1/completions,temperature=0 \
--apply_chat_template --output_path "./result" --log_samples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants