-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to exactly reproduce the results on the openllm leaderboard? #2583
Comments
TL;DR: Try adjusting your The OpenLLM Leaderboard blog mentions that evaluation results can vary with different batch_size values due to padding-related issues. However, while padding has been addressed, some evaluation inconsistencies persist in certain cases. While the leaderboard blog mentions that the results may ‘vary slightly,’ our personal experience shows that evaluation scores with batch_size=1 and batch_size=8 could sometimes differ by as much as 10%, suggesting the issue could be more significant than described. For further context, check related discussions: |
Hi @Ryuuranwlb, I don't think the issue is related to batch size, as the raw results I shown is generated by huggingface, and there is a huge gap between the raw results 0.6915466 and the displayed results 55.93% |
Perhaps you could try further inspecting the settings related to apply_chat_template for Instruct models. Also, note that batch_size causing significant discrepancies are possible (not knowing why). I ran the following lm_eval CLI command to evaluate the dataset leaderboard_ifeval(without chat template) on the model dolphin-2.9.2-Phi-3-Medium with batch_size=1 and batch_size=8 on an H800 GPU. The results differed astonishingly.
|
But the problem is that it is not the inconsistency between my running and huggingface. It is the inconsistency between huggingface's raw results and huggingface's webpage percentage results. The raw results I shown is provided by huggingface available at here (You might need to give permission to this dataset to see the results). |
It would also help if the avg is returned, so that no further calculation is needed to compare against the leaderboard. For example llama reports "the average across all the scores." |
Similarly, for 8B model.
Benchmark on mmlu pro: 0.3068 # my results are obtained from running vllm==0.6.3.post1, lm_eval==0.4.7
HF_TOKEN=xxxxx OPENAI_API_KEY=xxxxx --model local-completions --tasks leaderboard \
--model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://locally_hosted_vllm/v1/completions,temperature=0 \
--apply_chat_template --output_path "./result" --log_samples |
Hi,
I am trying to reproduce the results on the openllm leaderboard of huggingface, however, I found some inconsistency between the results generated by harness and the results shown on the leaderboard. For example, for
meta-llama/Meta-Llama-3.1-70B-Instruct
, its average bbh acc_norm is 0.6915466064919285 (see first attached figure) in the raw results generated by harness, but on the leaderboard, it is shown as 55.93%. I am wondering how the 55.93% is calculated from 0.6915466064919285, thanks!Similarly, for mmlu-pro, the raw results is 0.5309175531914894 but the leaderboard is shown as 47.88%.
The text was updated successfully, but these errors were encountered: