v0.4.1
Release Notes
This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .
At a high level, some of the changes include:
- Data-parallel inference using vLLM (contributed by @baberabb )
- A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
- Miscellaneous documentation updates
- A number of new tasks, and bugfixes to old tasks!
- The support for OpenAI-like API models using
local-completions
orlocal-chat-completions
( Thanks to @veekaybee @mgoin @anjor and others on this)! - Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!
More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!
We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.
In the next version release, we hope to include
- Chat Templating + System Prompt support, for locally-run models
- Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting
- General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times / faster non-inference processing steps especially when num_fewshot is large!
- A new
TaskManager
object and the deprecation oflm_eval.tasks.initialize_tasks()
, for achieving the easier registration of many tasks and configuration of new groups of tasks
What's Changed
- Announce v0.4.0 in README by @haileyschoelkopf in #1061
- remove commented planned samplers in
lm_eval/api/samplers.py
by @haileyschoelkopf in #1062 - Confirming links in docs work (WIP) by @haileyschoelkopf in #1065
- Set actual version to v0.4.0 by @haileyschoelkopf in #1064
- Updating docs hyperlinks by @haileyschoelkopf in #1066
- Fiddling with READMEs, Reenable CI tests on
main
by @haileyschoelkopf in #1063 - Update _cot_fewshot_template_yaml by @lintangsutawika in #1074
- Patch scrolls by @lintangsutawika in #1077
- Update template of qqp dataset by @shiweijiezero in #1097
- Change the sub-task name from sst to sst2 in glue by @shiweijiezero in #1099
- Add kmmlu evaluation to tasks by @h-albert-lee in #1089
- Fix stderr by @lintangsutawika in #1106
- Simplified
evaluator.py
by @lintangsutawika in #1104 - [Refactor] vllm data parallel by @baberabb in #1035
- Unpack group in
write_out
by @baberabb in #1113 - Revert "Simplified
evaluator.py
" by @lintangsutawika in #1116 qqp
,mnli_mismatch
: remove unlabled test sets by @baberabb in #1114- fix: bug of BBH_cot_fewshot by @Momo-Tori in #1118
- Bump BBH version by @haileyschoelkopf in #1120
- Refactor
hf
modeling code by @haileyschoelkopf in #1096 - Additional process for doc_to_choice by @lintangsutawika in #1093
- doc_to_decontamination_query can use function by @lintangsutawika in #1082
- Fix vllm
batch_size
type by @xTayEx in #1128 - fix: passing max_length to vllm engine args by @NanoCode012 in #1124
- Fix Loading Local Dataset by @lintangsutawika in #1127
- place model onto
mps
by @baberabb in #1133 - Add benchmark FLD by @MorishT in #1122
- fix typo in README.md by @lennijusten in #1136
- add correct openai api key to README.md by @lennijusten in #1138
- Update Linter CI Job by @haileyschoelkopf in #1130
- add utils.clear_torch_cache() to model_comparator by @baberabb in #1142
- Enabling OpenAI completions via gooseai by @veekaybee in #1141
- vllm clean up tqdm by @baberabb in #1144
- openai nits by @baberabb in #1139
- Add IFEval / Instruction-Following Eval by @wiskojo in #1087
- set
--gen_kwargs
arg to None by @baberabb in #1145 - Add shorthand flags by @baberabb in #1149
- fld bugfix by @baberabb in #1150
- Remove GooseAI docs and change no-commit-to-branch precommit hook by @veekaybee in #1154
- Add docs on adding a multiple choice metric by @polm-stability in #1147
- Simplify evaluator by @lintangsutawika in #1126
- Generalize Qwen tokenizer fix by @haileyschoelkopf in #1146
- self.device in huggingface.py line 210 treated as torch.device but might be a string by @pminervini in #1172
- Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation by @seungduk-yanolja in #1171
- feat: add option to upload results to Zeno by @Sparkier in #990
- Switch Linting to
ruff
by @baberabb in #1166 - Error in --num_fewshot option for K-MMLU Evaluation Harness by @guijinSON in #1178
- Implementing local OpenAI API-style chat completions on any given inference server by @veekaybee in #1174
- Update README.md by @anjor in #1184
- Update README.md by @anjor in #1183
- Add tokenizer backend by @anjor in #1186
- Correctly Print Task Versioning by @haileyschoelkopf in #1173
- update Zeno example and reference in README by @Sparkier in #1190
- Remove tokenizer for openai chat completions by @anjor in #1191
- Update README.md by @anjor in #1181
- disable
mypy
by @baberabb in #1193 - Generic decorator for handling rate limit errors by @zachschillaci27 in #1109
- Refer in README to main branch by @BramVanroy in #1200
- Hardcode 0-shot for fewshot Minerva Math tasks by @haileyschoelkopf in #1189
- Upstream Mamba Support (
mamba_ssm
) by @haileyschoelkopf in #1110 - Update cuda handling by @anjor in #1180
- Fix documentation in API table by @haileyschoelkopf in #1203
- Consolidate batching by @baberabb in #1197
- Add remove_whitespace to FLD benchmark by @MorishT in #1206
- Fix the argument order in
utils.divide
doc by @xTayEx in #1208 - [Fix #1211 ] pin vllm at < 0.2.6 by @haileyschoelkopf in #1212
- fix unbounded local variable by @onnoo in #1218
- nits + fix siqa by @baberabb in #1216
- add length of strings and answer options to Zeno metadata by @Sparkier in #1222
- Don't silence errors when loading tasks by @polm-stability in #1148
- Update README.md by @anjor in #1195
- Update race's README.md by @pminervini in #1230
- batch_schedular bug in Collator by @baberabb in #1229
- Update openai_completions.py by @StellaAthena in #1238
- vllm: handle max_length better and substitute Collator by @baberabb in #1241
- Remove self.dataset_path post_init process by @lintangsutawika in #1243
- Add multilingual HellaSwag task by @JorgeDeCorte in #1228
- Do not escape ascii in logging outputs by @passaglia in #1246
- fixed fewshot loading for multiple input tasks by @lintangsutawika in #1255
- Revert citation by @StellaAthena in #1257
- Specify utf-8 encoding to properly save non-ascii samples to file by @baberabb in #1265
- Fix evaluation for the belebele dataset by @jmichaelov in #1267
- Call "exact_match" once for each multiple-target sample by @baberabb in #1266
- MultiMedQA by @tmabraham in #1198
- Fix bug in multi-token Stop Sequences by @haileyschoelkopf in #1268
- Update Table Printing by @haileyschoelkopf in #1271
- add Kobest by @jp1924 in #1263
- Apply
process_docs()
to fewshot_split by @haileyschoelkopf in #1276 - Fix whitespace issues in GSM8k-CoT by @haileyschoelkopf in #1275
- Make
parallelize=True
vs.accelerate launch
distinction clearer in docs by @haileyschoelkopf in #1261 - Allow parameter edits for registered tasks when listed in a benchmark by @lintangsutawika in #1273
- Fix data-parallel evaluation with quantized models by @haileyschoelkopf in #1270
- Rework documentation for explaining local dataset by @lintangsutawika in #1284
- Update CITATION.bib by @haileyschoelkopf in #1285
- Update
nq_open
/ NaturalQs whitespacing by @haileyschoelkopf in #1289 - Update README.md with custom integration doc by @msaroufim in #1298
- Update nq_open.yaml by @Hannibal046 in #1305
- Update task_guide.md by @daniellepintz in #1306
- Pin
datasets
dependency at 2.15 by @haileyschoelkopf in #1312 - Fix polemo2_in.yaml subset name by @lhoestq in #1313
- Fix
datasets
dependency to >=2.14 by @haileyschoelkopf in #1314 - Fix group register by @lintangsutawika in #1315
- Update task_guide.md by @djstrong in #1316
- Update polemo2_in.yaml by @lintangsutawika in #1318
- Fix: Mamba receives extra kwargs by @haileyschoelkopf in #1328
- Fix Issue regarding stderr by @lintangsutawika in #1327
- Add
local-completions
support using OpenAI interface by @mgoin in #1277 - fallback to classname when LM doesnt have config by @nairbv in #1334
- fix a trailing whitespace that breaks a lint job by @nairbv in #1335
- skip "benchmarks" in changed_tasks by @baberabb in #1336
- Update migrated HF dataset paths by @haileyschoelkopf in #1332
- Don't use
get_task_dict()
in task registration / initialization by @haileyschoelkopf in #1331 - manage default (greedy) gen_kwargs in vllm by @baberabb in #1341
- vllm: change default gen_kwargs behaviour; prompt_logprobs=1 by @baberabb in #1345
- Update links to advanced_task_guide.md by @haileyschoelkopf in #1348
Filter
docs not offset bydoc_id
by @baberabb in #1349- Add FAQ on
lm_eval.tasks.initialize_tasks()
to README by @haileyschoelkopf in #1330 - Refix issue regarding stderr by @thnkinbtfly in #1357
- Add causalLM OpenVino models by @NoushNabi in #1290
- Apply some best practices and guideline recommendations to code by @LSinev in #1363
- serialize callable functions in config by @baberabb in #1367
- delay filter init; remove
*args
by @baberabb in #1369 - Fix unintuitive
--gen_kwargs
behavior by @haileyschoelkopf in #1329 - Publish to pypi by @anjor in #1194
- Make dependencies compatible with PyPI by @haileyschoelkopf in #1378
New Contributors
- @shiweijiezero made their first contribution in #1097
- @h-albert-lee made their first contribution in #1089
- @Momo-Tori made their first contribution in #1118
- @xTayEx made their first contribution in #1128
- @NanoCode012 made their first contribution in #1124
- @MorishT made their first contribution in #1122
- @lennijusten made their first contribution in #1136
- @veekaybee made their first contribution in #1141
- @wiskojo made their first contribution in #1087
- @polm-stability made their first contribution in #1147
- @seungduk-yanolja made their first contribution in #1171
- @Sparkier made their first contribution in #990
- @anjor made their first contribution in #1184
- @zachschillaci27 made their first contribution in #1109
- @BramVanroy made their first contribution in #1200
- @onnoo made their first contribution in #1218
- @JorgeDeCorte made their first contribution in #1228
- @jmichaelov made their first contribution in #1267
- @jp1924 made their first contribution in #1263
- @msaroufim made their first contribution in #1298
- @Hannibal046 made their first contribution in #1305
- @daniellepintz made their first contribution in #1306
- @lhoestq made their first contribution in #1313
- @djstrong made their first contribution in #1316
- @nairbv made their first contribution in #1334
- @thnkinbtfly made their first contribution in #1357
- @NoushNabi made their first contribution in #1290
- @LSinev made their first contribution in #1363
Full Changelog: v0.4.0...v0.4.1