How to exactly reproduce the results on the openllm leaderboard? #2583

Zilinghan · 2024-12-19T16:55:41Z

Hi,

I am trying to reproduce the results on the openllm leaderboard of huggingface, however, I found some inconsistency between the results generated by harness and the results shown on the leaderboard. For example, for meta-llama/Meta-Llama-3.1-70B-Instruct, its average bbh acc_norm is 0.6915466064919285 (see first attached figure) in the raw results generated by harness, but on the leaderboard, it is shown as 55.93%. I am wondering how the 55.93% is calculated from 0.6915466064919285, thanks!

Similarly, for mmlu-pro, the raw results is 0.5309175531914894 but the leaderboard is shown as 47.88%.

The text was updated successfully, but these errors were encountered:

Ryuuranwlb · 2024-12-20T07:24:05Z

TL;DR: Try adjusting your batch_size, as it can influence evaluation results due to some unknown GPU-related optimization issues (not sure if it's caused by lm-eval itself). Some users have reported that the results are consistent when using CPUs for evaluation. Additionally, this batch_size problem appears not only in leaderboard datasets but also in other evaluation contexts.

The OpenLLM Leaderboard blog mentions that evaluation results can vary with different batch_size values due to padding-related issues. However, while padding has been addressed, some evaluation inconsistencies persist in certain cases. While the leaderboard blog mentions that the results may ‘vary slightly,’ our personal experience shows that evaluation scores with batch_size=1 and batch_size=8 could sometimes differ by as much as 10%, suggesting the issue could be more significant than described.

For further context, check related discussions:
• Issue #1625
• Issue #1645

Zilinghan · 2024-12-22T16:31:36Z

Hi @Ryuuranwlb, I don't think the issue is related to batch size, as the raw results I shown is generated by huggingface, and there is a huge gap between the raw results 0.6915466 and the displayed results 55.93%

Ryuuranwlb · 2024-12-22T18:48:14Z

Perhaps you could try further inspecting the settings related to apply_chat_template for Instruct models.

Also, note that batch_size causing significant discrepancies are possible (not knowing why). I ran the following lm_eval CLI command to evaluate the dataset leaderboard_ifeval(without chat template) on the model dolphin-2.9.2-Phi-3-Medium with batch_size=1 and batch_size=8 on an H800 GPU. The results differed astonishingly.

lm_eval \
    --model hf \
    --model_args pretrained=cognitivecomputations/dolphin-2.9.2-Phi-3-Medium,trust_remote_code=True \
    --tasks leaderboard_ifeval \
    --batch_size 1

Batch size 1 result:

Batch size 8 result:

Zilinghan · 2024-12-22T21:59:32Z

But the problem is that it is not the inconsistency between my running and huggingface. It is the inconsistency between huggingface's raw results and huggingface's webpage percentage results. The raw results I shown is provided by huggingface available at here (You might need to give permission to this dataset to see the results).

CandiedCode · 2024-12-30T21:23:41Z

It would also help if the avg is returned, so that no further calculation is needed to compare against the leaderboard.

For example llama reports "the average across all the scores."
ifeval

cpwan · 2024-12-31T07:25:09Z

Similarly, for 8B model.

Cannot relate the huggingface reported json with the benchmark page.
Discrepancy between my own runs with the huggingface reported one.

Benchmark on mmlu pro: 0.3068
Reported on mmlu pro: 0.3746
My results on mmlu pro: 0.1629

# my results are obtained from running vllm==0.6.3.post1, lm_eval==0.4.7
HF_TOKEN=xxxxx OPENAI_API_KEY=xxxxx --model local-completions --tasks leaderboard \
--model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://locally_hosted_vllm/v1/completions,temperature=0 \
--apply_chat_template --output_path "./result" --log_samples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to exactly reproduce the results on the openllm leaderboard? #2583

How to exactly reproduce the results on the openllm leaderboard? #2583

Zilinghan commented Dec 19, 2024

Ryuuranwlb commented Dec 20, 2024 •

edited

Loading

Zilinghan commented Dec 22, 2024

Ryuuranwlb commented Dec 22, 2024 •

edited

Loading

Zilinghan commented Dec 22, 2024

CandiedCode commented Dec 30, 2024 •

edited

Loading

cpwan commented Dec 31, 2024

How to exactly reproduce the results on the openllm leaderboard? #2583

How to exactly reproduce the results on the openllm leaderboard? #2583

Comments

Zilinghan commented Dec 19, 2024

Ryuuranwlb commented Dec 20, 2024 • edited Loading

Zilinghan commented Dec 22, 2024

Ryuuranwlb commented Dec 22, 2024 • edited Loading

Zilinghan commented Dec 22, 2024

CandiedCode commented Dec 30, 2024 • edited Loading

cpwan commented Dec 31, 2024

Ryuuranwlb commented Dec 20, 2024 •

edited

Loading

Ryuuranwlb commented Dec 22, 2024 •

edited

Loading

CandiedCode commented Dec 30, 2024 •

edited

Loading