Add support for generative answering of multiple_choice tasks #2601

pasky · 2024-12-29T23:57:21Z

As they are, multiple_choice tasks cannot be evaluated with popular API models that support only generate_until, not logprob.

This MR introduces a flag that allows these tasks to be still evaluated by introducing an emulation mode of sorts that just asks the model to generate the answer. The abcd approach and most of the prompt is borrowed from openai/evals.

pasky · 2024-12-30T00:02:14Z

For reference, some tinyBenchmark results with claude:

lm_eval  --tasks tinyArc --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------|------:|------|-----:|--------|---|-----:|---|------|
|tinyArc|      0|none  |    25|acc_norm|↑  |0.8819|±  |   N/A|

lm_eval  --tasks tinyHellaswag --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|------|
|tinyHellaswag|      0|none  |    10|acc_norm|↑  |0.8283|±  |   N/A|

lm_eval  --tasks tinyMMLU --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
| Tasks  |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|--------|------:|------|-----:|--------|---|----:|---|------|
|tinyMMLU|      0|none  |     0|acc_norm|↑  |0.789|±  |   N/A|

LSinev · 2024-12-30T10:12:37Z

lm_eval/api/task.py

+                    doc_system_instruction += " "
+                if multiple_choice_generate == "abcd":


May I suggest to not hardcode these. What if doc_system_instruction supposed to be delimited with some other delimiter? What if set of choices is not 4 letters, not these 4 letters, or not letters at all? This framework supports external tasks and also have multiple forks already, so there may be (I am not using "are" because of no intention to google proof of this idea) multiple choice tasks set up differently than "abcd".

LSinev · 2024-12-30T10:13:35Z

lm_eval/api/task.py

+                    doc_system_instruction += "Please include \"ANSWER: <letter>\" in your response with the letter of the correct last answer."
+                else:
+                    doc_system_instruction += "Please answer with the letter of the correct last answer."


What about non-english tasks that are already inside this repo?

pasky added 4 commits December 28, 2024 18:46

fix(zeno): Generate unique ids in case of multiple filters

c225602

fix(zeno): Report even non-aggregable metrics, just not as metrics

0bd64c2

Add a basic support for --multiple-choice-generate

5cca68f

Add support for --multiple_choice_generate abcd

d9e49af

pasky requested review from baberabb and lintangsutawika as code owners December 29, 2024 23:57

pasky mentioned this pull request Dec 30, 2024

Add Logits to OpenAI ChatCompletions model #1196

Closed

LSinev reviewed Dec 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for generative answering of multiple_choice tasks #2601

Add support for generative answering of multiple_choice tasks #2601

pasky commented Dec 29, 2024

pasky commented Dec 30, 2024

LSinev Dec 30, 2024

LSinev Dec 30, 2024

		doc_system_instruction += " "
		if multiple_choice_generate == "abcd":

Add support for generative answering of multiple_choice tasks #2601

Are you sure you want to change the base?

Add support for generative answering of multiple_choice tasks #2601

Conversation

pasky commented Dec 29, 2024

pasky commented Dec 30, 2024

LSinev Dec 30, 2024

Choose a reason for hiding this comment

LSinev Dec 30, 2024

Choose a reason for hiding this comment