lm-eval v0.4.2 Release Notes

We are releasing a new minor version of lm-eval for PyPI users! We've been very happy to see continued usage of the lm-evaluation-harness, including as a standard testbench to propel new architecture design (https://arxiv.org/abs/2402.18668), to ease new benchmark creation (https://arxiv.org/abs/2402.11548, https://arxiv.org/abs/2402.00786, https://arxiv.org/abs/2403.01469), enabling controlled experimentation on LLM evaluation (https://arxiv.org/abs/2402.01781), and more!

New Additions

Request Caching by @inf3rnus - speedups on startup via caching the construction of documents/requests’ contexts
Weights and Biases logging by @ayulockin - evals can now be logged to both WandB and Zeno!
New Tasks
- KMMLU, a localized - not (auto) translated! - dataset for testing Korean knowledge by @h-albert-lee @guijinSON
- GPQA by @uanu2002
- French Bench by @ManuelFay
- EQ-Bench by @pbevan1 and @sqrkl
- HAERAE-Bench, readded by @h-albert-lee
- Updates to answer parsing on many generative tasks (GSM8k, MGSM, BBH zeroshot) by @thinknbtfly!
- Okapi (translated) Open LLM Leaderboard tasks by @uanu2002 and @giux78
- Arabic MMLU and aEXAMS by @khalil-Hennara
- And more!
Re-introduction of TemplateLM base class for lower-code new LM class implementations by @anjor
Run the library with metrics/scoring stage skipped via --predict_only by @baberabb
Many more miscellaneous improvements by a lot of great contributors!

Backwards Incompatibilities

There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:

`TaskManager` API

previously, users had to call lm_eval.tasks.initialize_tasks() to register the library's default tasks, or lm_eval.tasks.include_path() to include a custom directory of task YAML configs.

Old usage:

import lm_eval

lm_eval.tasks.initialize_tasks() 
# or:
lm_eval.tasks.include_path("/path/to/my/custom/tasks")

 
lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"])

New intended usage:

import lm_eval

# optional--only need to instantiate separately if you want to pass custom path!
task_manager = TaskManager() # pass include_path="/path/to/my/custom/tasks" if desired

lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"], task_manager=task_manager)

get_task_dict() now also optionally takes a TaskManager object, when wanting to load custom tasks.

This should allow for much faster library startup times due to lazily loading requested tasks or groups.

Updated Stderr Aggregation

Previous versions of the library incorrectly reported erroneously large stderr scores for groups of tasks such as MMLU.

We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information.

As always, please feel free to give us feedback or request new features! We're grateful for the community's support.

What's Changed

Add support for RWKV models with World tokenizer by @PicoCreator in #1374
add bypass metric by @baberabb in #1156
Expand docs, update CITATION.bib by @haileyschoelkopf in #1227
Hf: minor egde cases by @baberabb in #1380
Enable override of printed n-shot in table by @haileyschoelkopf in #1379
Faster Task and Group Loading, Allow Recursive Groups by @lintangsutawika in #1321
Fix for #1383 by @pminervini in #1384
fix on --task list by @lintangsutawika in #1387
Support for Inf2 optimum class [WIP] by @michaelfeil in #1364
Update README.md by @mycoalchen in #1398
Fix confusing write_out.py instructions in README by @haileyschoelkopf in #1371
Use Pooled rather than Combined Variance for calculating stderr of task groupings by @haileyschoelkopf in #1390
adding hf_transfer by @michaelfeil in #1400
batch_size with auto defaults to 1 if No executable batch size found is raised by @pminervini in #1405
Fix printing bug in #1390 by @haileyschoelkopf in #1414
Fixes #1416 by @pminervini in #1418
Fix watchdog timeout by @JeevanBhoot in #1404
Evaluate by @baberabb in #1385
Add multilingual ARC task by @uanu2002 in #1419
Add multilingual TruthfulQA task by @uanu2002 in #1420
[m_mmul] added multilingual evaluation from alexandrainst/m_mmlu by @giux78 in #1358
Added seeds to evaluator.simple_evaluate signature by @Am1n3e in #1412
Fix: task weighting by subtask size ; update Pooled Stderr formula slightly by @haileyschoelkopf in #1427
Refactor utilities into a separate model utils file. by @baberabb in #1429
Nit fix: Updated OpenBookQA Readme by @adavidho in #1430
improve hf_transfer activation by @michaelfeil in #1438
Correct typo in task name in ARC documentation by @larekrow in #1443
update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) by @thnkinbtfly in #1356
Add a new task HaeRae-Bench by @h-albert-lee in #1445
Group reqs by context by @baberabb in #1425
Add a new task GPQA (the part without CoT) by @uanu2002 in #1434
Added KMMLU evaluation method and changed ReadMe by @h-albert-lee in #1447
Add TemplateLM boilerplate LM class by @anjor in #1279
Log which subtasks were called with which groups by @haileyschoelkopf in #1456
PR fixing the issue #1391 (wrong contexts in the mgsm task) by @leocnj in #1440
feat: Add Weights and Biases support by @ayulockin in #1339
Fixed generation args issue affection OpenAI completion model by @Am1n3e in #1458
update parsing logic of mgsm following gsm8k (mgsm en 0 -> 50%) by @thnkinbtfly in #1462
Adding documentation for Weights and Biases CLI interface by @veekaybee in #1466
Add environment and transformers version logging in results dump by @LSinev in #1464
Apply code autoformatting with Ruff to tasks/*.py an *init.py by @LSinev in #1469
Setting trust_remote_code to True for HuggingFace datasets compatibility by @veekaybee in #1467
add arabic mmlu by @khalil-Hennara in #1402
Add Gemma support (Add flag to control BOS token usage) by @haileyschoelkopf in #1465
Revert "Setting trust_remote_code to True for HuggingFace datasets compatibility" by @haileyschoelkopf in #1474
Create a means for caching task registration and request building. Ad… by @inf3rnus in #1372
Cont metrics by @lintangsutawika in #1475
Refactor evaluater.evaluate by @baberabb in #1441
add multilingual mmlu eval by @jordane95 in #1484
Update TruthfulQA val split name by @haileyschoelkopf in #1488
Fix AttributeError in huggingface.py When 'model_type' is Missing by @richwardle in #1489
Fix duplicated kwargs in some model init by @lchu-ibm in #1495
Add multilingual truthfulqa targets by @jordane95 in #1499
Always include EOS token as stop sequence by @haileyschoelkopf in #1480
Improve data-parallel request partitioning for VLLM by @haileyschoelkopf in #1477
modify WandbLogger to accept arbitrary kwargs by @baberabb in #1491
Vllm update DP+TP by @baberabb in #1508
Setting trust_remote_code to True for HuggingFace datasets compatibility by @veekaybee in #1487
Cleaning up unused unit tests by @veekaybee in #1516
French Bench by @ManuelFay in #1500
Hotfix: fix TypeError in --trust_remote_code by @haileyschoelkopf in #1517
Fix minor edge cases (#951 #1503) by @haileyschoelkopf in #1520
Openllm benchmark by @baberabb in #1526
Add a new task GPQA (the part CoT and generative) by @uanu2002 in #1482
Add EQ-Bench as per #1459 by @pbevan1 in #1511
Add WMDP Multiple-choice by @justinphan3110 in #1534
Adding new task : KorMedMCQA by @sean0042 in #1530
Update docs on LM.loglikelihood_rolling abstract method by @haileyschoelkopf in #1532
Minor KMMLU cleanup by @haileyschoelkopf in #1502
Cleanup and fixes (Task, Instance, and a little bit of *evaluate) by @LSinev in #1533
Update installation commands in openai_completions.py and contributing document and, update wandb_args description by @naem1023 in #1536
Add compatibility for vLLM's new Logprob object by @Yard1 in #1549
Fix incorrect max_gen_toks generation kwarg default in code2_text. by @cosmo3769 in #1551
Support jinja templating for task descriptions by @HishamYahya in #1553
Fix incorrect max_gen_toks generation kwarg default in generative Bigbench by @haileyschoelkopf in #1546
Hardcode IFEval to 0-shot by @haileyschoelkopf in #1506
add Arabic EXAMS benchmark by @khalil-Hennara in #1498
AGIEval by @haileyschoelkopf in #1359
cli_evaluate calls simple_evaluate with the same verbosity. by @Wongboo in #1563
add manual tqdm disabling management by @artemorloff in #1569
Fix README section on vllm integration by @eitanturok in #1579
Fix Jinja template for Advanced AI Risk by @RylanSchaeffer in #1587
Proposed approach for testing CLI arg parsing by @veekaybee in #1566
Patch for Seq2Seq Model predictions by @lintangsutawika in #1584
Add start date in results.json by @djstrong in #1592
Cleanup for v0.4.2 release by @haileyschoelkopf in #1573
Fix eval_logger import for mmlu/_generate_configs.py by @noufmitla in #1593

New Contributors

@PicoCreator made their first contribution in #1374
@michaelfeil made their first contribution in #1364
@mycoalchen made their first contribution in #1398
@JeevanBhoot made their first contribution in #1404
@uanu2002 made their first contribution in #1419
@giux78 made their first contribution in #1358
@Am1n3e made their first contribution in #1412
@adavidho made their first contribution in #1430
@larekrow made their first contribution in #1443
@leocnj made their first contribution in #1440
@ayulockin made their first contribution in #1339
@khalil-Hennara made their first contribution in #1402
@inf3rnus made their first contribution in #1372
@jordane95 made their first contribution in #1484
@richwardle made their first contribution in #1489
@lchu-ibm made their first contribution in #1495
@pbevan1 made their first contribution in #1511
@justinphan3110 made their first contribution in #1534
@sean0042 made their first contribution in #1530
@naem1023 made their first contribution in #1536
@Yard1 made their first contribution in #1549
@cosmo3769 made their first contribution in #1551
@HishamYahya made their first contribution in #1553
@Wongboo made their first contribution in #1563
@artemorloff made their first contribution in #1569
@eitanturok made their first contribution in #1579
@RylanSchaeffer made their first contribution in #1587
@noufmitla made their first contribution in #1593

Full Changelog: v0.4.1...v0.4.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.2

lm-eval v0.4.2 Release Notes

New Additions

Backwards Incompatibilities

`TaskManager` API

Updated Stderr Aggregation

What's Changed

New Contributors

Contributors

v0.4.2

lm-eval v0.4.2 Release Notes

New Additions

Backwards Incompatibilities

TaskManager API

Updated Stderr Aggregation

What's Changed

New Contributors

Contributors

`TaskManager` API