Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for generative answering of multiple_choice tasks #2601

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pasky
Copy link

@pasky pasky commented Dec 29, 2024

As they are, multiple_choice tasks cannot be evaluated with popular API models that support only generate_until, not logprob.

This MR introduces a flag that allows these tasks to be still evaluated by introducing an emulation mode of sorts that just asks the model to generate the answer. The abcd approach and most of the prompt is borrowed from openai/evals.

@pasky
Copy link
Author

pasky commented Dec 30, 2024

For reference, some tinyBenchmark results with claude:

lm_eval  --tasks tinyArc --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------|------:|------|-----:|--------|---|-----:|---|------|
|tinyArc|      0|none  |    25|acc_norm|↑  |0.8819|±  |   N/A|

lm_eval  --tasks tinyHellaswag --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|------|
|tinyHellaswag|      0|none  |    10|acc_norm|↑  |0.8283|±  |   N/A|

lm_eval  --tasks tinyMMLU --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
| Tasks  |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|--------|------:|------|-----:|--------|---|----:|---|------|
|tinyMMLU|      0|none  |     0|acc_norm|↑  |0.789|±  |   N/A|

Comment on lines +444 to +445
doc_system_instruction += " "
if multiple_choice_generate == "abcd":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest to not hardcode these. What if doc_system_instruction supposed to be delimited with some other delimiter? What if set of choices is not 4 letters, not these 4 letters, or not letters at all? This framework supports external tasks and also have multiple forks already, so there may be (I am not using "are" because of no intention to google proof of this idea) multiple choice tasks set up differently than "abcd".

Comment on lines +446 to +448
doc_system_instruction += "Please include \"ANSWER: <letter>\" in your response with the letter of the correct last answer."
else:
doc_system_instruction += "Please answer with the letter of the correct last answer."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about non-english tasks that are already inside this repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants