v0.4.3
lm-eval v0.4.3 Release Notes
We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.
New Additions
The big new feature is the often-requested Chat Templating, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors!
You can now run using a chat template with --apply_chat_template
and a system prompt of your choosing using --system_instruction "my sysprompt here"
. The --fewshot_as_multiturn
flag can control whether each few-shot example in context is a new conversational turn or not.
This feature is currently only supported for model types hf
and vllm
but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.
There's a lot more to check out, including:
-
Logging results to the HF Hub if desired using
--hf_hub_log_args
, by @KonradSzafer and team! -
NeMo model support by @sergiopperez !
-
Anthropic Chat API support by @tryuman !
-
DeepSparse and SparseML model types by @mgoin !
-
Handling of delta-weights in HF models, by @KonradSzafer !
-
LoRA support for VLLM, by @bcicc !
-
Fixes to PEFT modules which add new tokens to the embedding layers, by @mapmeld !
-
Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by @djstrong !
-
The use of custom
Sampler
subclasses in tasks, by @LSinev ! -
The ability to specify "hardcoded" few-shot examples more cleanly, by @clefourrier !
-
Support for Ascend NPUs (
--device npu
) by @statelesshz, @zhabuye, @jiaqiw09 and others! -
Logging of
higher_is_better
in results tables for clearer understanding of eval metrics by @zafstojano ! -
extra info logged about models, including info about tokenizers, chat templating, and more, by @artemorloff @djstrong and others!
-
Miscellaneous bug fixes! And many more great contributions we weren't able to list here.
New Tasks
We had a number of new tasks contributed. A listing of subfolders and a brief description of the tasks contained in them can now be found at lm_eval/tasks/README.md
. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the appropriate README.md within a given lm_eval/tasks
subfolder, for further info on each task contained within a given folder. Thank you to @anthonydipofi @Harryalways317 @nairbv @sepiatone and others for working on this and giving feedback!
Without further ado, the tasks:
- ACLUE, a benchmark for Ancient Chinese understanding, by @haonan-li
- BasqueGlue and EusExams, two Basque-language tasks by @juletx
- TMMLU+, an evaluation for Traditional Chinese, contributed by @ZoneTwelve
- XNLIeu, a Basque version of XNLI, by @juletx
- Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by @mukobi
- FDA, SWDE, and Squad-Completion zero-shot tasks by @simran-arora and team
- Added back the
hendrycks_math
task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing - COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by @Erland366
- tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by @LucWeber and team!
- Glianorex, a benchmark for testing performance on fictional medical questions, by @maximegmd
- New FLD (formal logic) task variants by @MorishT
- Improved translations of Lambada Multilingual tasks, added by @zafstojano
- NoticIA, a Spanish summarization dataset by @ikergarcia1996
- The Paloma perplexity benchmark, added by @zafstojano
- We've removed the AMMLU dataset due to concerns about auto-translation quality.
- Added the localized, not translated, ArabicMMLU dataset, contributed by @Yazeed7 !
- BertaQA, a Basque cultural knowledge benchmark, by @juletx
- New machine-translated ARC-C datasets by @jonabur !
- CommonsenseQA, in a prompt format following Llama, by @murphybrendan
- ...
Backwards Incompatibilities
The save format for logged results has now changed.
output files will now be written to
{output_path}/{sanitized_model_name}/results_YYYY-MM-DDTHH-MM-SS.xxxxx.json
if--output_path
is set, and{output_path}/{sanitized_model_name}/samples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl
for each task's samples if--log_samples
is set.
e.g. outputs/gpt2/results_2024-06-28T00-00-00.00001.json
and outputs/gpt2/samples_lambada_openai_2024-06-28T00-00-00.00001.jsonl
.
See #1926 for utilities which may help to work with these new filenames.
Future Plans
In general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!
-
The official Open LLM Leaderboard 2 tasks will be landing soon in the Eval Harness main branch and subsequently in
v0.4.4
on PyPI! -
The fact that
group
s of tasks by-default attempt to report an aggregated score across constituent subtasks has been a sharp edge. We are finishing up some internal reworking to distinguish betweengroup
s of tasks that do report aggregate scores (thinkmmlu
) versustag
s which simply are a convenient shortcut to call a bunch of tasks one might want to run at once (think thepythia
grouping which merely represents a collection of tasks one might want to gather results on each of all at once but where averaging doesn't make sense). -
We'd also like to improve the API model support in the Eval Harness from its current state.
-
More to come!
Thank you to everyone who's contributed to or used the library!
Thanks, @haileyschoelkopf @lintangsutawika
What's Changed
- use BOS token in loglikelihood by @djstrong in #1588
- Revert "Patch for Seq2Seq Model predictions" by @haileyschoelkopf in #1601
- fix gen_kwargs arg reading by @artemorloff in #1607
- fix until arg processing by @artemorloff in #1608
- Fixes to Loglikelihood prefix token / VLLM by @haileyschoelkopf in #1611
- Add ACLUE task by @haonan-li in #1614
- OpenAI Completions -- fix passing of unexpected 'until' arg by @haileyschoelkopf in #1612
- add logging of model args by @baberabb in #1619
- Add vLLM FAQs to README (#1625) by @haileyschoelkopf in #1633
- peft Version Assertion by @LameloBally in #1635
- Seq2seq fix by @lintangsutawika in #1604
- Integration of NeMo models into LM Evaluation Harness library by @sergiopperez in #1598
- Fix conditional import for Nemo LM class by @haileyschoelkopf in #1641
- Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring by @orsharir in #1647
- Add Latxa paper evaluation tasks for Basque by @juletx in #1654
- Fix CLI --batch_size arg for openai-completions/local-completions by @mgoin in #1656
- Patch QQP prompt (#1648 ) by @haileyschoelkopf in #1661
- TMMLU+ implementation by @ZoneTwelve in #1394
- Anthropic Chat API by @tryumanshow in #1594
- correction bug #1664 by @nicho2 in #1670
- Signpost potential bugs / unsupported ops in MPS backend by @haileyschoelkopf in #1680
- Add delta weights model loading by @KonradSzafer in #1712
- Add
neuralmagic
models forsparseml
anddeepsparse
by @mgoin in #1674 - Improvements to run NVIDIA NeMo models on LM Evaluation Harness by @sergiopperez in #1699
- Adding retries and rate limit to toxicity tasks by @sator-labs in #1620
- reference
--tasks list
in README by @nairbv in #1726 - Add XNLIeu: a dataset for cross-lingual NLI in Basque by @juletx in #1694
- Fix Parameter Propagation for Tasks that have
include
by @lintangsutawika in #1749 - Support individual scrolls datasets by @giorgossideris in #1740
- Add filter registry decorator by @lozhn in #1750
- remove duplicated
num_fewshot: 0
by @chujiezheng in #1769 - Pile 10k new task by @mukobi in #1758
- Fix m_arc choices by @jordane95 in #1760
- upload new tasks by @simran-arora in #1728
- vllm lora support by @bcicc in #1756
- Add option to set OpenVINO config by @helena-intel in #1730
- evaluation tracker implementation by @KonradSzafer in #1766
- eval tracker args fix by @KonradSzafer in #1777
- limit fix by @KonradSzafer in #1785
- remove echo parameter in OpenAI completions API by @djstrong in #1779
- Fix README: change
----hf_hub_log_args
to--hf_hub_log_args
by @MuhammadBinUsman03 in #1776 - Fix bug in setting until kwarg in openai completions by @ciaranby in #1784
- Provide ability for custom sampler for ConfigurableTask by @LSinev in #1616
- Update
--tasks list
option in interface documentation by @sepiatone in #1792 - Fix Caching Tests ; Remove
pretrained=gpt2
default by @haileyschoelkopf in #1775 - link to the example output on the hub by @KonradSzafer in #1798
- Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant by @haileyschoelkopf in #1793
- Logging Updates (Alphabetize table printouts, fix eval tracker bug) (#1774) by @haileyschoelkopf in #1791
- Initial integration of the Unitxt to LM eval harness by @yoavkatz in #1615
- add task for mmlu evaluation in arc multiple choice format by @jonabur in #1745
- Update flag
--hf_hub_log_args
in interface documentation by @sepiatone in #1806 - Copal task by @Erland366 in #1803
- Adding tinyBenchmarks datasets by @LucWeber in #1545
- interface doc update by @KonradSzafer in #1807
- Fix links in README guiding to another branch by @LSinev in #1838
- Fix: support PEFT/LoRA with added tokens by @mapmeld in #1828
- Fix incorrect check for task type by @zafstojano in #1865
- Fixing typos in
docs
by @zafstojano in #1863 - Update polemo2_out.yaml by @zhabuye in #1871
- Unpin vllm in dependencies by @edgan8 in #1874
- Fix outdated links to the latest links in
docs
by @oneonlee in #1876 - [HFLM]Use Accelerate's API to reduce hard-coded CUDA code by @statelesshz in #1880
- Fix
batch_size=auto
for HF Seq2Seq models (#1765) by @haileyschoelkopf in #1790 - Fix Brier Score by @lintangsutawika in #1847
- Fix for bootstrap_iters = 0 case (#1715) by @haileyschoelkopf in #1789
- add mmlu tasks from pile-t5 by @lintangsutawika in #1710
- Bigbench fix by @lintangsutawika in #1686
- Rename
lm_eval.logging -> lm_eval.loggers
by @haileyschoelkopf in #1858 - Updated vllm imports in vllm_causallms.py by @mgoin in #1890
- [HFLM]Add support for Ascend NPU by @statelesshz in #1886
higher_is_better
tickers in output table by @zafstojano in #1893- Add dataset card when pushing to HF hub by @KonradSzafer in #1898
- Making hardcoded few shots compatible with the chat template mechanism by @clefourrier in #1895
- Try to make existing tests run little bit faster by @LSinev in #1905
- Fix fewshot seed only set when overriding num_fewshot by @LSinev in #1914
- Complete task list from pr 1727 by @anthony-dipofi in #1901
- Add chat template by @KonradSzafer in #1873
- Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data by @maximegmd in #1867
- Modify pre-commit hook to check merge conflicts accidentally committed by @LSinev in #1927
- [add] fld logical formula task by @MorishT in #1931
- Add new Lambada translations by @zafstojano in #1897
- Implement NoticIA by @ikergarcia1996 in #1912
- Add The Arabic version of the PICA benchmark by @khalil-Hennara in #1917
- Fix social_iqa answer choices by @haileyschoelkopf in #1909
- Update basque-glue by @zhabuye in #1913
- Test output table layout consistency by @zafstojano in #1916
- Fix a tiny typo in
__main__.py
by @sadra-barikbin in #1939 - Add the Arabic version with refactor to Arabic pica to be in alghafa … by @khalil-Hennara in #1940
- Results filenames handling fix by @KonradSzafer in #1926
- Remove AMMLU Due to Translation by @haileyschoelkopf in #1948
- Add option in TaskManager to not index library default tasks ; Tests for include_path by @haileyschoelkopf in #1856
- Force BOS token usage in 'gemma' models for VLLM by @haileyschoelkopf in #1857
- Fix a tiny typo in
docs/interface.md
by @sadra-barikbin in #1955 - Fix self.max_tokens in anthropic_llms.py by @lozhn in #1848
samples
is newline delimited by @baberabb in #1930- Fix
--gen_kwargs
and VLLM (temperature
not respected) by @haileyschoelkopf in #1800 - Make
scripts.write_out
error out when no splits match by @haileyschoelkopf in #1796 - fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' by @johnwee1 in #1956
- add trust_remote_code for piqa by @changwangss in #1983
- Fix self assignment in neuron_optimum.py by @LSinev in #1990
- [New Task] Add Paloma benchmark by @zafstojano in #1928
- Fix Paloma Template yaml by @haileyschoelkopf in #1993
- Log
fewshot_as_multiturn
in results files by @haileyschoelkopf in #1995 - Added ArabicMMLU by @Yazeed7 in #1987
- Fix Datasets
--trust_remote_code
by @haileyschoelkopf in #1998 - Add BertaQA dataset tasks by @juletx in #1964
- add tokenizer logs info by @artemorloff in #1731
- Hotfix breaking import by @StellaAthena in #2015
- add arc_challenge_mt by @jonabur in #1900
- Remove
LM
dependency frombuild_all_requests
by @baberabb in #2011 - Added CommonsenseQA task by @murphybrendan in #1721
- Factor out LM-specific tests by @haileyschoelkopf in #1859
- Update interface.md by @johnwee1 in #1982
- Fix
trust_remote_code
-related test failures by @haileyschoelkopf in #2024 - Fixes scrolls task bug with few_shot examples by @xksteven in #2003
- fix cache by @baberabb in #2037
- Add chat template to
vllm
by @baberabb in #2034 - Fail gracefully upon tokenizer logging failure (#2035) by @haileyschoelkopf in #2038
- Bundle
exact_match
HF Evaluate metric with install, don't call evaluate.load() on import by @haileyschoelkopf in #2045 - Update package version to v0.4.3 by @haileyschoelkopf in #2046
New Contributors
- @LameloBally made their first contribution in #1635
- @sergiopperez made their first contribution in #1598
- @orsharir made their first contribution in #1647
- @ZoneTwelve made their first contribution in #1394
- @tryumanshow made their first contribution in #1594
- @nicho2 made their first contribution in #1670
- @KonradSzafer made their first contribution in #1712
- @sator-labs made their first contribution in #1620
- @giorgossideris made their first contribution in #1740
- @lozhn made their first contribution in #1750
- @chujiezheng made their first contribution in #1769
- @mukobi made their first contribution in #1758
- @simran-arora made their first contribution in #1728
- @bcicc made their first contribution in #1756
- @helena-intel made their first contribution in #1730
- @MuhammadBinUsman03 made their first contribution in #1776
- @ciaranby made their first contribution in #1784
- @sepiatone made their first contribution in #1792
- @yoavkatz made their first contribution in #1615
- @Erland366 made their first contribution in #1803
- @LucWeber made their first contribution in #1545
- @mapmeld made their first contribution in #1828
- @zafstojano made their first contribution in #1865
- @zhabuye made their first contribution in #1871
- @edgan8 made their first contribution in #1874
- @oneonlee made their first contribution in #1876
- @statelesshz made their first contribution in #1880
- @clefourrier made their first contribution in #1895
- @maximegmd made their first contribution in #1867
- @ikergarcia1996 made their first contribution in #1912
- @sadra-barikbin made their first contribution in #1939
- @johnwee1 made their first contribution in #1956
- @changwangss made their first contribution in #1983
- @Yazeed7 made their first contribution in #1987
- @murphybrendan made their first contribution in #1721
- @xksteven made their first contribution in #2003
Full Changelog: v0.4.2...v0.4.3