Release v0.4.1 · EleutherAI/lm-evaluation-harness

Release Notes

This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .

At a high level, some of the changes include:

Data-parallel inference using vLLM (contributed by @baberabb )
A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
Miscellaneous documentation updates
A number of new tasks, and bugfixes to old tasks!
The support for OpenAI-like API models using local-completions or local-chat-completions ( Thanks to @veekaybee @mgoin @anjor and others on this)!
Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!

More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!

We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.

In the next version release, we hope to include

Chat Templating + System Prompt support, for locally-run models
Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting
General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times / faster non-inference processing steps especially when num_fewshot is large!
A new TaskManager object and the deprecation of lm_eval.tasks.initialize_tasks(), for achieving the easier registration of many tasks and configuration of new groups of tasks

What's Changed

Announce v0.4.0 in README by @haileyschoelkopf in #1061
remove commented planned samplers in lm_eval/api/samplers.py by @haileyschoelkopf in #1062
Confirming links in docs work (WIP) by @haileyschoelkopf in #1065
Set actual version to v0.4.0 by @haileyschoelkopf in #1064
Updating docs hyperlinks by @haileyschoelkopf in #1066
Fiddling with READMEs, Reenable CI tests on main by @haileyschoelkopf in #1063
Update _cot_fewshot_template_yaml by @lintangsutawika in #1074
Patch scrolls by @lintangsutawika in #1077
Update template of qqp dataset by @shiweijiezero in #1097
Change the sub-task name from sst to sst2 in glue by @shiweijiezero in #1099
Add kmmlu evaluation to tasks by @h-albert-lee in #1089
Fix stderr by @lintangsutawika in #1106
Simplified evaluator.py by @lintangsutawika in #1104
[Refactor] vllm data parallel by @baberabb in #1035
Unpack group in write_out by @baberabb in #1113
Revert "Simplified evaluator.py" by @lintangsutawika in #1116
qqp, mnli_mismatch: remove unlabled test sets by @baberabb in #1114
fix: bug of BBH_cot_fewshot by @Momo-Tori in #1118
Bump BBH version by @haileyschoelkopf in #1120
Refactor hf modeling code by @haileyschoelkopf in #1096
Additional process for doc_to_choice by @lintangsutawika in #1093
doc_to_decontamination_query can use function by @lintangsutawika in #1082
Fix vllm batch_size type by @xTayEx in #1128
fix: passing max_length to vllm engine args by @NanoCode012 in #1124
Fix Loading Local Dataset by @lintangsutawika in #1127
place model onto mps by @baberabb in #1133
Add benchmark FLD by @MorishT in #1122
fix typo in README.md by @lennijusten in #1136
add correct openai api key to README.md by @lennijusten in #1138
Update Linter CI Job by @haileyschoelkopf in #1130
add utils.clear_torch_cache() to model_comparator by @baberabb in #1142
Enabling OpenAI completions via gooseai by @veekaybee in #1141
vllm clean up tqdm by @baberabb in #1144
openai nits by @baberabb in #1139
Add IFEval / Instruction-Following Eval by @wiskojo in #1087
set --gen_kwargs arg to None by @baberabb in #1145
Add shorthand flags by @baberabb in #1149
fld bugfix by @baberabb in #1150
Remove GooseAI docs and change no-commit-to-branch precommit hook by @veekaybee in #1154
Add docs on adding a multiple choice metric by @polm-stability in #1147
Simplify evaluator by @lintangsutawika in #1126
Generalize Qwen tokenizer fix by @haileyschoelkopf in #1146
self.device in huggingface.py line 210 treated as torch.device but might be a string by @pminervini in #1172
Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation by @seungduk-yanolja in #1171
feat: add option to upload results to Zeno by @Sparkier in #990
Switch Linting to ruff by @baberabb in #1166
Error in --num_fewshot option for K-MMLU Evaluation Harness by @guijinSON in #1178
Implementing local OpenAI API-style chat completions on any given inference server by @veekaybee in #1174
Update README.md by @anjor in #1184
Update README.md by @anjor in #1183
Add tokenizer backend by @anjor in #1186
Correctly Print Task Versioning by @haileyschoelkopf in #1173
update Zeno example and reference in README by @Sparkier in #1190
Remove tokenizer for openai chat completions by @anjor in #1191
Update README.md by @anjor in #1181
disable mypy by @baberabb in #1193
Generic decorator for handling rate limit errors by @zachschillaci27 in #1109
Refer in README to main branch by @BramVanroy in #1200
Hardcode 0-shot for fewshot Minerva Math tasks by @haileyschoelkopf in #1189
Upstream Mamba Support (mamba_ssm) by @haileyschoelkopf in #1110
Update cuda handling by @anjor in #1180
Fix documentation in API table by @haileyschoelkopf in #1203
Consolidate batching by @baberabb in #1197
Add remove_whitespace to FLD benchmark by @MorishT in #1206
Fix the argument order in utils.divide doc by @xTayEx in #1208
[Fix #1211 ] pin vllm at < 0.2.6 by @haileyschoelkopf in #1212
fix unbounded local variable by @onnoo in #1218
nits + fix siqa by @baberabb in #1216
add length of strings and answer options to Zeno metadata by @Sparkier in #1222
Don't silence errors when loading tasks by @polm-stability in #1148
Update README.md by @anjor in #1195
Update race's README.md by @pminervini in #1230
batch_schedular bug in Collator by @baberabb in #1229
Update openai_completions.py by @StellaAthena in #1238
vllm: handle max_length better and substitute Collator by @baberabb in #1241
Remove self.dataset_path post_init process by @lintangsutawika in #1243
Add multilingual HellaSwag task by @JorgeDeCorte in #1228
Do not escape ascii in logging outputs by @passaglia in #1246
fixed fewshot loading for multiple input tasks by @lintangsutawika in #1255
Revert citation by @StellaAthena in #1257
Specify utf-8 encoding to properly save non-ascii samples to file by @baberabb in #1265
Fix evaluation for the belebele dataset by @jmichaelov in #1267
Call "exact_match" once for each multiple-target sample by @baberabb in #1266
MultiMedQA by @tmabraham in #1198
Fix bug in multi-token Stop Sequences by @haileyschoelkopf in #1268
Update Table Printing by @haileyschoelkopf in #1271
add Kobest by @jp1924 in #1263
Apply process_docs() to fewshot_split by @haileyschoelkopf in #1276
Fix whitespace issues in GSM8k-CoT by @haileyschoelkopf in #1275
Make parallelize=True vs. accelerate launch distinction clearer in docs by @haileyschoelkopf in #1261
Allow parameter edits for registered tasks when listed in a benchmark by @lintangsutawika in #1273
Fix data-parallel evaluation with quantized models by @haileyschoelkopf in #1270
Rework documentation for explaining local dataset by @lintangsutawika in #1284
Update CITATION.bib by @haileyschoelkopf in #1285
Update nq_open / NaturalQs whitespacing by @haileyschoelkopf in #1289
Update README.md with custom integration doc by @msaroufim in #1298
Update nq_open.yaml by @Hannibal046 in #1305
Update task_guide.md by @daniellepintz in #1306
Pin datasets dependency at 2.15 by @haileyschoelkopf in #1312
Fix polemo2_in.yaml subset name by @lhoestq in #1313
Fix datasets dependency to >=2.14 by @haileyschoelkopf in #1314
Fix group register by @lintangsutawika in #1315
Update task_guide.md by @djstrong in #1316
Update polemo2_in.yaml by @lintangsutawika in #1318
Fix: Mamba receives extra kwargs by @haileyschoelkopf in #1328
Fix Issue regarding stderr by @lintangsutawika in #1327
Add local-completions support using OpenAI interface by @mgoin in #1277
fallback to classname when LM doesnt have config by @nairbv in #1334
fix a trailing whitespace that breaks a lint job by @nairbv in #1335
skip "benchmarks" in changed_tasks by @baberabb in #1336
Update migrated HF dataset paths by @haileyschoelkopf in #1332
Don't use get_task_dict() in task registration / initialization by @haileyschoelkopf in #1331
manage default (greedy) gen_kwargs in vllm by @baberabb in #1341
vllm: change default gen_kwargs behaviour; prompt_logprobs=1 by @baberabb in #1345
Update links to advanced_task_guide.md by @haileyschoelkopf in #1348
Filter docs not offset by doc_id by @baberabb in #1349
Add FAQ on lm_eval.tasks.initialize_tasks() to README by @haileyschoelkopf in #1330
Refix issue regarding stderr by @thnkinbtfly in #1357
Add causalLM OpenVino models by @NoushNabi in #1290
Apply some best practices and guideline recommendations to code by @LSinev in #1363
serialize callable functions in config by @baberabb in #1367
delay filter init; remove *args by @baberabb in #1369
Fix unintuitive --gen_kwargs behavior by @haileyschoelkopf in #1329
Publish to pypi by @anjor in #1194
Make dependencies compatible with PyPI by @haileyschoelkopf in #1378

New Contributors

@shiweijiezero made their first contribution in #1097
@h-albert-lee made their first contribution in #1089
@Momo-Tori made their first contribution in #1118
@xTayEx made their first contribution in #1128
@NanoCode012 made their first contribution in #1124
@MorishT made their first contribution in #1122
@lennijusten made their first contribution in #1136
@veekaybee made their first contribution in #1141
@wiskojo made their first contribution in #1087
@polm-stability made their first contribution in #1147
@seungduk-yanolja made their first contribution in #1171
@Sparkier made their first contribution in #990
@anjor made their first contribution in #1184
@zachschillaci27 made their first contribution in #1109
@BramVanroy made their first contribution in #1200
@onnoo made their first contribution in #1218
@JorgeDeCorte made their first contribution in #1228
@jmichaelov made their first contribution in #1267
@jp1924 made their first contribution in #1263
@msaroufim made their first contribution in #1298
@Hannibal046 made their first contribution in #1305
@daniellepintz made their first contribution in #1306
@lhoestq made their first contribution in #1313
@djstrong made their first contribution in #1316
@nairbv made their first contribution in #1334
@thnkinbtfly made their first contribution in #1357
@NoushNabi made their first contribution in #1290
@LSinev made their first contribution in #1363

Full Changelog: v0.4.0...v0.4.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.1

Release Notes

What's Changed

New Contributors

Contributors