You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've observed a substantial difference between the officially reported ARC-Challenge accuracy and my local evaluation results using lm-evaluation-harness.
Official Results
Reported 25-shot arc_challenge accuracy on HuggingFace repo (meta-llama/Llama-3.2-3B): 69.1%
My hypothesis is that, the Llama model requires a chat template. Therefore, I tested the meta-llama/Llama-3.2-3B-Instruct with the official chat template: accelerate launch -m lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,trust_remote_code=True,dtype="bfloat16" \ --tasks arc_challenge \ --num_fewshot 25 \ --apply_chat_template \ --fewshot_as_multiturn \ --batch_size 8
However, this produces similar results:
Tasks
Version
Filter
n-shot
Metric
Value
Stderr
arc_challenge
1
none
25
acc
0.4983
0.0146
arc_challenge
1
none
25
acc_norm
0.5341
0.0146
I'm wondering about what caused this inconsistency.
The text was updated successfully, but these errors were encountered:
Hi! They use the MMLU format for this task (<question><A. answer_choice>,...><Answer: ><"A">) rather than the standard cloze format (<question><Answer: ><"answer_choice">).
Description
I've observed a substantial difference between the officially reported ARC-Challenge accuracy and my local evaluation results using lm-evaluation-harness.
Official Results
Reported 25-shot arc_challenge accuracy on HuggingFace repo (meta-llama/Llama-3.2-3B): 69.1%
Local Evaluation Results
Base Model Test
My command line:
accelerate launch -m lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-3B,trust_remote_code=True \ --tasks arc_challenge \ --batch_size 8 \ --num_fewshot 25
Results:
Instruct Model
My hypothesis is that, the Llama model requires a chat template. Therefore, I tested the meta-llama/Llama-3.2-3B-Instruct with the official chat template:
accelerate launch -m lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,trust_remote_code=True,dtype="bfloat16" \ --tasks arc_challenge \ --num_fewshot 25 \ --apply_chat_template \ --fewshot_as_multiturn \ --batch_size 8
However, this produces similar results:
I'm wondering about what caused this inconsistency.
The text was updated successfully, but these errors were encountered: