Skip to content

v0.4.3

Compare
Choose a tag to compare
@haileyschoelkopf haileyschoelkopf released this 01 Jul 14:00
· 171 commits to main since this release
3fa4fd7

lm-eval v0.4.3 Release Notes

We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.

New Additions

The big new feature is the often-requested Chat Templating, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors!

You can now run using a chat template with --apply_chat_template and a system prompt of your choosing using --system_instruction "my sysprompt here". The --fewshot_as_multiturn flag can control whether each few-shot example in context is a new conversational turn or not.

This feature is currently only supported for model types hf and vllm but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.

There's a lot more to check out, including:

  • Logging results to the HF Hub if desired using --hf_hub_log_args, by @KonradSzafer and team!

  • NeMo model support by @sergiopperez !

  • Anthropic Chat API support by @tryuman !

  • DeepSparse and SparseML model types by @mgoin !

  • Handling of delta-weights in HF models, by @KonradSzafer !

  • LoRA support for VLLM, by @bcicc !

  • Fixes to PEFT modules which add new tokens to the embedding layers, by @mapmeld !

  • Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by @djstrong !

  • The use of custom Sampler subclasses in tasks, by @LSinev !

  • The ability to specify "hardcoded" few-shot examples more cleanly, by @clefourrier !

  • Support for Ascend NPUs (--device npu) by @statelesshz, @zhabuye, @jiaqiw09 and others!

  • Logging of higher_is_better in results tables for clearer understanding of eval metrics by @zafstojano !

  • extra info logged about models, including info about tokenizers, chat templating, and more, by @artemorloff @djstrong and others!

  • Miscellaneous bug fixes! And many more great contributions we weren't able to list here.

New Tasks

We had a number of new tasks contributed. A listing of subfolders and a brief description of the tasks contained in them can now be found at lm_eval/tasks/README.md. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the appropriate README.md within a given lm_eval/tasks subfolder, for further info on each task contained within a given folder. Thank you to @anthonydipofi @Harryalways317 @nairbv @sepiatone and others for working on this and giving feedback!

Without further ado, the tasks:

  • ACLUE, a benchmark for Ancient Chinese understanding, by @haonan-li
  • BasqueGlue and EusExams, two Basque-language tasks by @juletx
  • TMMLU+, an evaluation for Traditional Chinese, contributed by @ZoneTwelve
  • XNLIeu, a Basque version of XNLI, by @juletx
  • Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by @mukobi
  • FDA, SWDE, and Squad-Completion zero-shot tasks by @simran-arora and team
  • Added back the hendrycks_math task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing
  • COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by @Erland366
  • tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by @LucWeber and team!
  • Glianorex, a benchmark for testing performance on fictional medical questions, by @maximegmd
  • New FLD (formal logic) task variants by @MorishT
  • Improved translations of Lambada Multilingual tasks, added by @zafstojano
  • NoticIA, a Spanish summarization dataset by @ikergarcia1996
  • The Paloma perplexity benchmark, added by @zafstojano
  • We've removed the AMMLU dataset due to concerns about auto-translation quality.
  • Added the localized, not translated, ArabicMMLU dataset, contributed by @Yazeed7 !
  • BertaQA, a Basque cultural knowledge benchmark, by @juletx
  • New machine-translated ARC-C datasets by @jonabur !
  • CommonsenseQA, in a prompt format following Llama, by @murphybrendan
  • ...

Backwards Incompatibilities

The save format for logged results has now changed.

output files will now be written to

  • {output_path}/{sanitized_model_name}/results_YYYY-MM-DDTHH-MM-SS.xxxxx.json if --output_path is set, and
  • {output_path}/{sanitized_model_name}/samples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl for each task's samples if --log_samples is set.

e.g. outputs/gpt2/results_2024-06-28T00-00-00.00001.json and outputs/gpt2/samples_lambada_openai_2024-06-28T00-00-00.00001.jsonl.

See #1926 for utilities which may help to work with these new filenames.

Future Plans

In general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!

  • The official Open LLM Leaderboard 2 tasks will be landing soon in the Eval Harness main branch and subsequently in v0.4.4 on PyPI!

  • The fact that groups of tasks by-default attempt to report an aggregated score across constituent subtasks has been a sharp edge. We are finishing up some internal reworking to distinguish between groups of tasks that do report aggregate scores (think mmlu) versus tags which simply are a convenient shortcut to call a bunch of tasks one might want to run at once (think the pythia grouping which merely represents a collection of tasks one might want to gather results on each of all at once but where averaging doesn't make sense).

  • We'd also like to improve the API model support in the Eval Harness from its current state.

  • More to come!

Thank you to everyone who's contributed to or used the library!

Thanks, @haileyschoelkopf @lintangsutawika

What's Changed

New Contributors

Full Changelog: v0.4.2...v0.4.3