lm-eval v0.4.3 Release Notes

We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.

New Additions

The big new feature is the often-requested Chat Templating, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors!

You can now run using a chat template with --apply_chat_template and a system prompt of your choosing using --system_instruction "my sysprompt here". The --fewshot_as_multiturn flag can control whether each few-shot example in context is a new conversational turn or not.

This feature is currently only supported for model types hf and vllm but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.

There's a lot more to check out, including:

Logging results to the HF Hub if desired using --hf_hub_log_args, by @KonradSzafer and team!
NeMo model support by @sergiopperez !
Anthropic Chat API support by @tryuman !
DeepSparse and SparseML model types by @mgoin !
Handling of delta-weights in HF models, by @KonradSzafer !
LoRA support for VLLM, by @bcicc !
Fixes to PEFT modules which add new tokens to the embedding layers, by @mapmeld !
Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by @djstrong !
The use of custom Sampler subclasses in tasks, by @LSinev !
The ability to specify "hardcoded" few-shot examples more cleanly, by @clefourrier !
Support for Ascend NPUs (--device npu) by @statelesshz, @zhabuye, @jiaqiw09 and others!
Logging of higher_is_better in results tables for clearer understanding of eval metrics by @zafstojano !
extra info logged about models, including info about tokenizers, chat templating, and more, by @artemorloff @djstrong and others!
Miscellaneous bug fixes! And many more great contributions we weren't able to list here.

New Tasks

We had a number of new tasks contributed. A listing of subfolders and a brief description of the tasks contained in them can now be found at lm_eval/tasks/README.md. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the appropriate README.md within a given lm_eval/tasks subfolder, for further info on each task contained within a given folder. Thank you to @anthonydipofi @Harryalways317 @nairbv @sepiatone and others for working on this and giving feedback!

Without further ado, the tasks:

ACLUE, a benchmark for Ancient Chinese understanding, by @haonan-li
BasqueGlue and EusExams, two Basque-language tasks by @juletx
TMMLU+, an evaluation for Traditional Chinese, contributed by @ZoneTwelve
XNLIeu, a Basque version of XNLI, by @juletx
Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by @mukobi
FDA, SWDE, and Squad-Completion zero-shot tasks by @simran-arora and team
Added back the hendrycks_math task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing
COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by @Erland366
tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by @LucWeber and team!
Glianorex, a benchmark for testing performance on fictional medical questions, by @maximegmd
New FLD (formal logic) task variants by @MorishT
Improved translations of Lambada Multilingual tasks, added by @zafstojano
NoticIA, a Spanish summarization dataset by @ikergarcia1996
The Paloma perplexity benchmark, added by @zafstojano
We've removed the AMMLU dataset due to concerns about auto-translation quality.
Added the localized, not translated, ArabicMMLU dataset, contributed by @Yazeed7 !
BertaQA, a Basque cultural knowledge benchmark, by @juletx
New machine-translated ARC-C datasets by @jonabur !
CommonsenseQA, in a prompt format following Llama, by @murphybrendan
...

Backwards Incompatibilities

The save format for logged results has now changed.

output files will now be written to

{output_path}/{sanitized_model_name}/results_YYYY-MM-DDTHH-MM-SS.xxxxx.json if --output_path is set, and
{output_path}/{sanitized_model_name}/samples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl for each task's samples if --log_samples is set.

e.g. outputs/gpt2/results_2024-06-28T00-00-00.00001.json and outputs/gpt2/samples_lambada_openai_2024-06-28T00-00-00.00001.jsonl.

See #1926 for utilities which may help to work with these new filenames.

Future Plans

In general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!

The official Open LLM Leaderboard 2 tasks will be landing soon in the Eval Harness main branch and subsequently in v0.4.4 on PyPI!
The fact that groups of tasks by-default attempt to report an aggregated score across constituent subtasks has been a sharp edge. We are finishing up some internal reworking to distinguish between groups of tasks that do report aggregate scores (think mmlu) versus tags which simply are a convenient shortcut to call a bunch of tasks one might want to run at once (think the pythia grouping which merely represents a collection of tasks one might want to gather results on each of all at once but where averaging doesn't make sense).
We'd also like to improve the API model support in the Eval Harness from its current state.
More to come!

Thank you to everyone who's contributed to or used the library!

Thanks, @haileyschoelkopf @lintangsutawika

What's Changed

use BOS token in loglikelihood by @djstrong in #1588
Revert "Patch for Seq2Seq Model predictions" by @haileyschoelkopf in #1601
fix gen_kwargs arg reading by @artemorloff in #1607
fix until arg processing by @artemorloff in #1608
Fixes to Loglikelihood prefix token / VLLM by @haileyschoelkopf in #1611
Add ACLUE task by @haonan-li in #1614
OpenAI Completions -- fix passing of unexpected 'until' arg by @haileyschoelkopf in #1612
add logging of model args by @baberabb in #1619
Add vLLM FAQs to README (#1625) by @haileyschoelkopf in #1633
peft Version Assertion by @LameloBally in #1635
Seq2seq fix by @lintangsutawika in #1604
Integration of NeMo models into LM Evaluation Harness library by @sergiopperez in #1598
Fix conditional import for Nemo LM class by @haileyschoelkopf in #1641
Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring by @orsharir in #1647
Add Latxa paper evaluation tasks for Basque by @juletx in #1654
Fix CLI --batch_size arg for openai-completions/local-completions by @mgoin in #1656
Patch QQP prompt (#1648 ) by @haileyschoelkopf in #1661
TMMLU+ implementation by @ZoneTwelve in #1394
Anthropic Chat API by @tryumanshow in #1594
correction bug #1664 by @nicho2 in #1670
Signpost potential bugs / unsupported ops in MPS backend by @haileyschoelkopf in #1680
Add delta weights model loading by @KonradSzafer in #1712
Add neuralmagic models for sparseml and deepsparse by @mgoin in #1674
Improvements to run NVIDIA NeMo models on LM Evaluation Harness by @sergiopperez in #1699
Adding retries and rate limit to toxicity tasks by @sator-labs in #1620
reference --tasks list in README by @nairbv in #1726
Add XNLIeu: a dataset for cross-lingual NLI in Basque by @juletx in #1694
Fix Parameter Propagation for Tasks that have include by @lintangsutawika in #1749
Support individual scrolls datasets by @giorgossideris in #1740
Add filter registry decorator by @lozhn in #1750
remove duplicated num_fewshot: 0 by @chujiezheng in #1769
Pile 10k new task by @mukobi in #1758
Fix m_arc choices by @jordane95 in #1760
upload new tasks by @simran-arora in #1728
vllm lora support by @bcicc in #1756
Add option to set OpenVINO config by @helena-intel in #1730
evaluation tracker implementation by @KonradSzafer in #1766
eval tracker args fix by @KonradSzafer in #1777
limit fix by @KonradSzafer in #1785
remove echo parameter in OpenAI completions API by @djstrong in #1779
Fix README: change----hf_hub_log_args to --hf_hub_log_args by @MuhammadBinUsman03 in #1776
Fix bug in setting until kwarg in openai completions by @ciaranby in #1784
Provide ability for custom sampler for ConfigurableTask by @LSinev in #1616
Update --tasks list option in interface documentation by @sepiatone in #1792
Fix Caching Tests ; Remove pretrained=gpt2 default by @haileyschoelkopf in #1775
link to the example output on the hub by @KonradSzafer in #1798
Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant by @haileyschoelkopf in #1793
Logging Updates (Alphabetize table printouts, fix eval tracker bug) (#1774) by @haileyschoelkopf in #1791
Initial integration of the Unitxt to LM eval harness by @yoavkatz in #1615
add task for mmlu evaluation in arc multiple choice format by @jonabur in #1745
Update flag --hf_hub_log_args in interface documentation by @sepiatone in #1806
Copal task by @Erland366 in #1803
Adding tinyBenchmarks datasets by @LucWeber in #1545
interface doc update by @KonradSzafer in #1807
Fix links in README guiding to another branch by @LSinev in #1838
Fix: support PEFT/LoRA with added tokens by @mapmeld in #1828
Fix incorrect check for task type by @zafstojano in #1865
Fixing typos in docs by @zafstojano in #1863
Update polemo2_out.yaml by @zhabuye in #1871
Unpin vllm in dependencies by @edgan8 in #1874
Fix outdated links to the latest links in docs by @oneonlee in #1876
[HFLM]Use Accelerate's API to reduce hard-coded CUDA code by @statelesshz in #1880
Fix batch_size=auto for HF Seq2Seq models (#1765) by @haileyschoelkopf in #1790
Fix Brier Score by @lintangsutawika in #1847
Fix for bootstrap_iters = 0 case (#1715) by @haileyschoelkopf in #1789
add mmlu tasks from pile-t5 by @lintangsutawika in #1710
Bigbench fix by @lintangsutawika in #1686
Rename lm_eval.logging -> lm_eval.loggers by @haileyschoelkopf in #1858
Updated vllm imports in vllm_causallms.py by @mgoin in #1890
[HFLM]Add support for Ascend NPU by @statelesshz in #1886
higher_is_better tickers in output table by @zafstojano in #1893
Add dataset card when pushing to HF hub by @KonradSzafer in #1898
Making hardcoded few shots compatible with the chat template mechanism by @clefourrier in #1895
Try to make existing tests run little bit faster by @LSinev in #1905
Fix fewshot seed only set when overriding num_fewshot by @LSinev in #1914
Complete task list from pr 1727 by @anthony-dipofi in #1901
Add chat template by @KonradSzafer in #1873
Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data by @maximegmd in #1867
Modify pre-commit hook to check merge conflicts accidentally committed by @LSinev in #1927
[add] fld logical formula task by @MorishT in #1931
Add new Lambada translations by @zafstojano in #1897
Implement NoticIA by @ikergarcia1996 in #1912
Add The Arabic version of the PICA benchmark by @khalil-Hennara in #1917
Fix social_iqa answer choices by @haileyschoelkopf in #1909
Update basque-glue by @zhabuye in #1913
Test output table layout consistency by @zafstojano in #1916
Fix a tiny typo in __main__.py by @sadra-barikbin in #1939
Add the Arabic version with refactor to Arabic pica to be in alghafa … by @khalil-Hennara in #1940
Results filenames handling fix by @KonradSzafer in #1926
Remove AMMLU Due to Translation by @haileyschoelkopf in #1948
Add option in TaskManager to not index library default tasks ; Tests for include_path by @haileyschoelkopf in #1856
Force BOS token usage in 'gemma' models for VLLM by @haileyschoelkopf in #1857
Fix a tiny typo in docs/interface.md by @sadra-barikbin in #1955
Fix self.max_tokens in anthropic_llms.py by @lozhn in #1848
samples is newline delimited by @baberabb in #1930
Fix --gen_kwargs and VLLM (temperature not respected) by @haileyschoelkopf in #1800
Make scripts.write_out error out when no splits match by @haileyschoelkopf in #1796
fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' by @johnwee1 in #1956
add trust_remote_code for piqa by @changwangss in #1983
Fix self assignment in neuron_optimum.py by @LSinev in #1990
[New Task] Add Paloma benchmark by @zafstojano in #1928
Fix Paloma Template yaml by @haileyschoelkopf in #1993
Log fewshot_as_multiturn in results files by @haileyschoelkopf in #1995
Added ArabicMMLU by @Yazeed7 in #1987
Fix Datasets --trust_remote_code by @haileyschoelkopf in #1998
Add BertaQA dataset tasks by @juletx in #1964
add tokenizer logs info by @artemorloff in #1731
Hotfix breaking import by @StellaAthena in #2015
add arc_challenge_mt by @jonabur in #1900
Remove LM dependency from build_all_requests by @baberabb in #2011
Added CommonsenseQA task by @murphybrendan in #1721
Factor out LM-specific tests by @haileyschoelkopf in #1859
Update interface.md by @johnwee1 in #1982
Fix trust_remote_code-related test failures by @haileyschoelkopf in #2024
Fixes scrolls task bug with few_shot examples by @xksteven in #2003
fix cache by @baberabb in #2037
Add chat template to vllm by @baberabb in #2034
Fail gracefully upon tokenizer logging failure (#2035) by @haileyschoelkopf in #2038
Bundle exact_match HF Evaluate metric with install, don't call evaluate.load() on import by @haileyschoelkopf in #2045
Update package version to v0.4.3 by @haileyschoelkopf in #2046

New Contributors

@LameloBally made their first contribution in #1635
@sergiopperez made their first contribution in #1598
@orsharir made their first contribution in #1647
@ZoneTwelve made their first contribution in #1394
@tryumanshow made their first contribution in #1594
@nicho2 made their first contribution in #1670
@KonradSzafer made their first contribution in #1712
@sator-labs made their first contribution in #1620
@giorgossideris made their first contribution in #1740
@lozhn made their first contribution in #1750
@chujiezheng made their first contribution in #1769
@mukobi made their first contribution in #1758
@simran-arora made their first contribution in #1728
@bcicc made their first contribution in #1756
@helena-intel made their first contribution in #1730
@MuhammadBinUsman03 made their first contribution in #1776
@ciaranby made their first contribution in #1784
@sepiatone made their first contribution in #1792
@yoavkatz made their first contribution in #1615
@Erland366 made their first contribution in #1803
@LucWeber made their first contribution in #1545
@mapmeld made their first contribution in #1828
@zafstojano made their first contribution in #1865
@zhabuye made their first contribution in #1871
@edgan8 made their first contribution in #1874
@oneonlee made their first contribution in #1876
@statelesshz made their first contribution in #1880
@clefourrier made their first contribution in #1895
@maximegmd made their first contribution in #1867
@ikergarcia1996 made their first contribution in #1912
@sadra-barikbin made their first contribution in #1939
@johnwee1 made their first contribution in #1956
@changwangss made their first contribution in #1983
@Yazeed7 made their first contribution in #1987
@murphybrendan made their first contribution in #1721
@xksteven made their first contribution in #2003

Full Changelog: v0.4.2...v0.4.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.3