Evaluation of LLM and LLM based Systems

Compendium of LLM Evaluation methods

Introduction

The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across. I aim to create a resource that will enable anyone with queries—whether it's about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.

My view on LLM Evaluation: Deck, and SF Big Analytics and AICamp video Analytics Vidhya (Data Phoenix Mar 5) (by Andrei Lopatenko)

The github repository

Reviews and Surveys
Leaderboards and Arenas
Evaluation Software
LLM Evaluation articles in tech media and blog posts from companies
Large benchmarks
Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation
Long Comprehensive Studies
HITL (Human in the Loop)
LLM as Judge
LLM Evaluation
LLM Systems
Other collections
Citation

Reviews and Surveys

Evaluating Large Language Models: A Comprehensive Survey , Oct 2023 arxiv:
A Survey on Evaluation of Large Language Models Jul 2023 arxiv:
Through the Lens of Core Competency: Survey on Evaluation of Large Language Models, Aug 2023 , arxiv:

Leaderboards and Arenas

New Hard Leaderboard by HuggingFace leaderboard description, blog post
LMSys Arena (explanation:)
Salesforce's Contextual Bench leaderboard hugging face an overview of how different LLMs perform across a variety of contextual tasks,
OpenGPT-X Multi- Lingual European LLM Leaderboard, evaluation of LLMs for many European languages - on HuggingFace
OpenLLM Leaderboard
MTEB
SWE Bench
AlpacaEval leaderboard Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv code
Open Medical LLM Leaderboard from HF Explanation
Gorilla, Berkeley function calling Leaderboard Explanation
WildBench WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
Enterprise Scenarios, Patronus
Vectara Hallucination Leaderboard
Ray/Anyscale's LLM Performance Leaderboard (explanation:)
Hugging Face LLM Performance hugging face leaderboard
Multi-task Language Understanding on MMLU

Evaluation Software

EleutherAI LLM Evaluation Harness
Eureka, Microsoft, A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. github Sep 2024 arxiv
OpenAI Evals
ConfidentAI DeepEval
MTEB
OpenICL Framework
RAGAS
ML Flow Evaluate
MosaicML Composer
Toolkit from Mozilla AI for LLM as judge evaluation tool: lm-buddy eval tool model: Prometheus
TruLens
Promptfoo
BigCode Evaluation Harness
LangFuse
LLMeBench see LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
ChainForge
Ironclad Rivet
LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models, arxiv pdf github repository

---

LLM Evaluation articles in tech media and blog posts from companies

Evaluate LLMs using Evaluation Harness and Hugging Face TGI/vLLM, Sep 2024, blog
Let's talk about LLM Evaluation, HuggingFace, article
Using LLMs for Evaluation LLM-as-a-Judge and other scalable additions to human quality ratings. Aug 2024, Deep Learning Focus
Introducing SimpleQA, OpenAI, Oct 2024 OpenAI
Catch me if you can! How to beat GPT-4 with a 13B model, LM sys org
Why it’s impossible to review AIs, and why TechCrunch is doing it anyway Techcrun mat 2024
A.I. has a measurement problem, NY Times, Apr 2024
Beyond Accuracy: The Changing Landscape Of AI Evaluation, Forbes, Mar 2024
Mozilla AI Exploring LLM Evaluation at scale
Evaluation part of How to Maximize LLM Performance
Mozilla AI blog published multiple good articles in Mozilla AI blog
Andrej Karpathy on evaluation X
From Meta on evaluation of Llama 3 models github
DeepMind AI Safety evaluation June 24 deepmind blog, Introducing Frontier Safety Framework
AI Snake Oil, June 2024, AI leaderboards are no longer useful. It's time to switch to Pareto curves.
Hamel Dev March 2024, Your AI Product Needs Eval. How to construct domain-specific LLM evaluation systems

Large benchmarks

MMLU Pro Massive Multitask Language Understanding - Pro version, Jun 2024, arxiv
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks EMNLP 2022, pdf
Measuring Massive Multitask Language Understanding, MMLU, ICLR, 2021, arxiv MMLU dataset
BigBench: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, 2022, arxiv, datasets
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Oct 2022, arxiv

Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks, May 2024, ICML 2024, arxiv
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations, EMNLP 2024, ACLAnthology
Lessons from the Trenches on Reproducible Evaluation of Language Models, May 2024, arxiv
Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat, Nov 2024, arxiv
Sabotage Evaluations for Frontier Models, Anthropic, Nov 2024, paper blog post
AI Benchmarks and Datasets for LLM Evaluation, Dec 2024, arxiv, a survey of many LLM becnmarks
Lessons from the Trenches on Reproducible Evaluation of Language Models, May 2024, arxiv
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks, Aug 2024, ACL 2024
Synthetic data in evaluation, see Chapter 3 in Best Practices and Lessons Learned on Synthetic Data for Language Models, Apr 2024, arxiv
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation, Nov 2023 arxiv
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards, Feb 2024, arxiv
Are Emergent Abilities of Large Language Models a Mirage? Apr 23 arxiv
Don't Make Your LLM an Evaluation Benchmark Cheater nov 2023 arxiv
Evaluating Question Answering Evaluation, 2019, ACL
Evaluating Open-QA Evaluation, 2023, arxiv
(RE: stat methods ) Prediction-Powered Inference jan 23 arxiv PPI++: Efficient Prediction-Powered Inference nov 23, arxiv
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress, Feb 2024, arxiv
Are We on the Right Way for Evaluating Large Vision-Language Models?, Apr 2024, arxiv
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Apr 2024, arxiv
Detecting Pretraining Data from Large Language Models, Oct 2023, arxiv
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, Apr 2024, arxiv
Faithful model evaluation for model-based metrics, EMNLP 2023, amazon science
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models, ICML 2023, mlr press
AI Snake Oil, June 2024, AI leaderboards are no longer useful. It's time to switch to Pareto curves.
State of What Art? A Call for Multi-Prompt LLM Evaluation , Aug 2024, Transactions of the Association for Computational Linguistics (2024) 12

Long Comprehensive Studies

TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv
Evaluation of OpenAI o1: Opportunities and Challenges of AGI, Sep 2024, arxiv

HITL (Human in the Loop)

Evaluating Question Answering Evaluation, 2019, ACL
Developing a Framework for Auditing Large Language Models Using Human-in-the-Loop, Feb 2024, arxiv
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation, Nov 2023, arxiv

LLM as Judge

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, Dec 2024, arxiv
Large Language Models are Inconsistent and Biased Evaluators, May 2024, arxiv
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation, Oct 2024, arxiv
Evaluating LLMs at Detecting Errors in LLM Responses, Apr 2024, arxiv
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models, Apr 2024, arxiv
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries, Sep 2024, arxiv
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Jun 2023, arxiv
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, Jum 2024, arxiv
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv leaderboard code
Discovering Language Model Behaviors with Model-Written Evaluations, Dec 2022, arxiv
The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate, Feb 2024, arxiv
Benchmarking Foundation Models with Language-Model-as-an-Examiner, 2022, NEURIPS
Red Teaming Language Models with Language Models, Feb 2022, arxiv
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate, Aug 2023, arxiv
ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning, Sep 2023, arxiv
Style Over Substance: Evaluation Biases for Large Language Models, Jul 2023, arxiv
Large Language Models Are State-of-the-Art Evaluators of Translation Quality, Feb 2023, arxiv
Large Language Models Are State-of-the-Art Evaluators of Code Generation, Apr 2023, researchgate
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators, Mar 2024, arxiv
LLM Evaluators Recognize and Favor Their Own Generations, Apr 2024, pdf
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences, Apr 2024, arxiv
Using LLMs for Evaluation LLM-as-a-Judge and other scalable additions to human quality ratings. Aug 2024, Deep Learning Focus

LLM Evaluation

Embeddings

MTEB: Massive Text Embedding Benchmark Oct 2022 [arxiv](https://arxiv.org/abs/2210.07316 Leaderboard) Leaderboard
Marqo embedding benchmark for eCommerce at Huggingface, text to image and category to image tasks
The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding, openreview pdf
MMTEB: Community driven extension to MTEB repository
Chinese MTEB C-MTEB repository
French MTEB repository

In Context Learning

HellaSwag, HellaSwag: Can a Machine Really Finish Your Sentence? 2019, arxiv Paper + code + dataset https://rowanzellers.com/hellaswag/
The LAMBADA dataset: Word prediction requiring a broad discourse context 2016, arxiv

Hallucinations

TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS: A PRINCIPLE AND BENCHMARK, Jan 2024, arxiv
INVITE: A testbed of automatically generated invalid questions to evaluate large language models for hallucinations, EMNLP 2023, amazon science
A Survey of Hallucination in Large Visual Language Models, Oct 2024, See Chapter IV, Evaluation of Hallucinations arxiv
Generating Benchmarks for Factuality Evaluation of Language Models, Jul 2023, arxiv
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models, Dec 2023, ACL
Long-form factuality in large language models, Mar 2024, arxiv
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, Sep 2023, arxiv
Measuring Faithfulness in Chain-of-Thought Reasoning, Jul 2023, [arxiv
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation, May 2023, arxiv repository
Introducing SimpleQA, OpenAI, Oct 2024 OpenAI

Question answering

QA is used in many vertical domains, see Vertical section bellow

Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository
Introducing SimpleQA, OpenAI, Oct 2024 OpenAI
Are Large Language Models Consistent over Value-laden Questions?, Jul 2024, arxiv
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, Jun 2019, ACL
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, Sep 2018, arxiv OpenBookQA dataset at AllenAI
Jin, Di, et al. "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams., 2020, arxiv MedQA
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018, arxiv ARC Easy dataset ARC dataset
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions, 2019, arxiv BoolQ dataset
HellaSwag, HellaSwag: Can a Machine Really Finish Your Sentence? 2019, arxiv Paper + code + dataset https://rowanzellers.com/hellaswag/
PIQA: Reasoning about Physical Commonsense in Natural Language, Nov 2019, arxiv PIQA dataset
Crowdsourcing Multiple Choice Science Questions arxiv SciQ dataset
WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2017, arxiv Winogrande dataset
TruthfulQA: Measuring How Models Mimic Human Falsehoods, Sep 2021, arxiv
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages, 2020, arxiv data
Natural Questions: A Benchmark for Question Answering Research, Transactions ACL 2019

Multi Turn

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models Nov 2023, arxiv
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues Feb 24 arxiv
How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv
Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models, Oct 2023, arxiv
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, NeurIPS 2023, NeurIPS
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback, Sep 2023, arxiv

Reasoning

FrontierMath at EpochAI, FrontierAI page, FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, Nov 2024, arxiv
Easy Problems That LLMs Get Wrong, May 2024, arxiv, a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) in domains such as logical reasoning, spatial intelligence, and linguistic understanding
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning, NeurIPS 2024 Track Datasets and Benchmarks Spotlight, Sep 2024, OpenReview
Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks 2023, arxiv
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, arxiv
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
Competition-Level Problems are Effective LLM Evaluators, Dec 23, arxiv
Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, Apr 2024, arxiv
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Oct 2023, arxiv

Multi-Lingual

AlGhafa Evaluation Benchmark for Arabic Language Models Dec 23, ACL Anthology ACL pdf article
CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian, Dec 2024, Tenth Italian Conference on Computational Linguistics,
Evaluating and Advancing Multimodal Large Language Models in Ability Lens, Nov 2024, arxiv
Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem HF blog
Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese , Apr 2024 arxiv
BanglaQuAD: A Bengali Open-domain Question Answering Dataset, Oct 2024, arxiv
AlignBench: Benchmarking Chinese Alignment of Large Language Models, Nov 2023, arxiv
The Invalsi Benchmark: measuring Language Models Mathematical and Language understanding in Italian, Mar 2024, arxiv
MEGA: Multilingual Evaluation of Generative AI, Mar 2023, arxiv
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models, 2023, NIPS website
LAraBench: Benchmarking Arabic AI with Large Language Models, May 23, arxiv
Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?, Apr 2024, arxiv

Multi-Lingual Embedding tasks

The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding, openreview pdf
Chinese MTEB C-MTEB repository
French MTEB repository
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, May 2023, arxiv

Multi-Modal

LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models, Nov 2024, IEEE
ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?, Dec 2024, arxiv
RealWorldQA, Apr 2024, HuggingFace
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models, Oct 2024, arxiv
MMBench: Is Your Multi-modal Model an All-Around Player?, Oct 2024 springer ECCV 2024
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models, Oct 2024, arxiv
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI, Apr 2024, arxiv
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, CVPR 2024, CVPR
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models, Dec 2024, open review github for the benchmark and evaluation framework
Careless Whisper: Speech-to-Text Hallucination Harms, FAccT '24, ACM
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?, Oct 2024 arxiv
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning, Oct 2024, Computer Vision – ECCV 2024
VHELM: A Holistic Evaluation of Vision Language Models, Oct 2024, arxiv
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models, Reka AI, May 2024 arxiv dataset blog post
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis, Aug 2024, arxiv
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models, Jun 2024, arxiv
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models, Jun 2024, arxiv
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models, Jun 2024, arxiv
Holistic Evaluation of Text-to-Image Models Nov 23 arxiv
VBench: Comprehensive Benchmark Suite for Video Generative Models Nov 23 arxiv
Evaluating Text-to-Visual Generation with Image-to-Text Generation, Apr 2024, arxiv
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Apr 2024, arxiv
Are We on the Right Way for Evaluating Large Vision-Language Models?, Apr 2024, arxiv
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning, Nov 2023, arxiv
BLINK: Multimodal Large Language Models Can See but Not Perceive, Apr 2024, arxiv github
Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, Apr 2024, arxiv
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, Apr 2024, arxiv
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models, Apr 2024, arxiv
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts, Oct 2023, [arxiv](MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts)
Evaluation part of https://arxiv.org/abs/2404.18930, Apr 2024, arxiv, repository
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use, Aug 2023. arxiv
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities, Aug 2023, arxiv
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension, Jul 2023, arxiv
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark, NeurIPS 2023, NeurIPS
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality, Apr 2023 arxiv

Instruction Following

Evaluating Large Language Models at Evaluating Instruction Following Oct 2023, arxiv
Find the INTENTION OF INSTRUCTION: Comprehensive Evaluation of Instruction Understanding for Large Language Models, Dec 2024, arxiv
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models, Dec 2024, arxiv
CFBench: A Comprehensive Constraints-Following Benchmark for LLMs. Aug 2024, arxiv
Instruction-Following Evaluation for Large Language Models, IFEval, Nov 2023, arxiv
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets, Jul 2023, arxiv , FLASK dataset
DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation, Mar 2024, aaai, pdf
LongForm: Effective Instruction Tuning with Reverse Instructions, Apr 2023, arxiv dataset

Ethical AI

Evaluating the Moral Beliefs Encoded in LLMs, Jul 23 arxiv
AI Deception: A Survey of Examples, Risks, and Potential Solutions Aug 23 arxiv
Aligning AI With Shared Human Value, Aug 20 - Feb 23, arxiv Re: ETHICS benchmark
What are human values, and how do we align AI to them?, Mar 2024, pdf
TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
Helpfulness, Honesty, Harmlessness (HHH) framework from Antrhtopic, introduced in A General Language Assistantas a Laboratory for Alignment, 2021, arxiv, it's in BigBench now bigbench
WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models, April 2024, arxiv
Chapter 19 in The Ethics of Advanced AI Assistants, Apr 2024, Google DeepMind, pdf at google
BEHONEST: Benchmarking Honesty of Large Language Models, June 2024, arxiv

Biases

FairPair: A Robust Evaluation of Biases in Language Models through Paired Perturbations, Apr 2024 arxiv
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation, 2021, arxiv, dataset
“I’m fully who I am”: Towards centering transgender and non-binary voices to measure biases in open language generation, ACM FAcct 2023, amazon science
This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models, May 2023, arxiv

Safe AI

Benchmark for general-purpose AI chat model, December 2024, AILuminate from ML Commons, mlcommons website
ECCV 2024 MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models, Jan 2024, github arxiv nov 2023
Introducing v0.5 of the AI Safety Benchmark from MLCommons, Apr 2024, arxiv
SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, Oct 2024, arxiv
Beyond Prompt Brittleness: Evaluating the Reliability and Consistency of Political Worldviews in LLMs , Nov 2024, MIT Press
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems, Jan 2024, arxiv
LLMSecCode: Evaluating Large Language Models for Secure Coding, Aug 2024, arxiv
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI, Sep 2024, arxiv
DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection, Sep 2024, arxiv
Purple Llama, an umbrella project from Meta, Purple Llama repository
Explore, Establish, Exploit: Red Teaming Language Models from Scratch, Jun 2023, arxiv
Rethinking Backdoor Detection Evaluation for Language Models, Aug 2024, arxiv pdf
Gradient-Based Language Model Red Teaming, Jan 24, arxiv
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models, Mar 2024, arxiv
Announcing a Benchmark to Improve AI Safety MLCommons has made benchmarks for AI performance—now it's time to measure safety, Apr 2024 IEEE Spectrum
Model evaluation for extreme risks, May 2023, arxiv
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Jan 2024, arxiv

Cybersecurity

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models, July 2023, Meta arxiv
CYBERSECEVAL 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models, Apr 2024, Meta arxiv
Benchmarking OpenAI o1 in Cyber Security, Oct 2024, arxiv
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models, Aug 2024, arxiv

Code Generating LLMs

Evaluating Large Language Models Trained on Code HumanEval Jul 2022 arxiv
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation Feb 21 arxiv
SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, Oct 2024, arxiv
LLMSecCode: Evaluating Large Language Models for Secure Coding, Aug 2024, arxiv
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
SWE Bench SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Feb 2024 arxiv Tech Report
Gorilla Functional Calling Leaderboard, Berkeley Leaderboard
DevBench: A Comprehensive Benchmark for Software Development, Mar 2024,arxiv
MBPP (Mostly Basic Python Programming) benchmark, introduced in Program Synthesis with Large Language Models , 2021 papers with code data
CodeMind: A Framework to Challenge Large Language Models for Code Reasoning, Feb 2024, arxiv
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution, Jan 2024, arxiv
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning, Jul 2022, arxiv code at salesforce github

Summarization

Human-like Summarization Evaluation with ChatGPT, Apr 2023, arxiv
A dataset and benchmark for hospital course summarization with adapted large language models, Dec 2024, Journal of the American Medical Informatics Association
Evaluating the Factual Consistency of Large Language Models Through News Summarization, Nov 2022, arxiv
USB: A Unified Summarization Benchmark Across Tasks and Domains, May 2023, arxiv
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization, 2021, Transactions ACL dataset

LLM quality (generic methods: overfitting, redundant layers etc)

WeightWatcher

Software Performance

Ray/Anyscale's LLM Performance Leaderboard (explanation:)
MLCommons MLPerf benchmarks (inference) MLPerf announcement of the LLM track

Agent LLM Architectures

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making, Oct 2024, arxiv
Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena, Oct 2023, arxiv
LLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games, Sep 2023,arxiv
AgentBench: Evaluating LLMs as Agents, Aug 2023, arxiv
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Mar 2024, arxiv
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents, Jan 2024, arxiv

Long Text Generation

Suri: Multi-constraint Instruction Following for Long-form Text Generation, Jun 2024, arxiv
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, Aug 2024, arxiv
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding, Aug 2023, arxiv
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models, Sep 2024, arxiv

Graph understanding

GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking, May 2023, arxiv
LLM4DyG: Can Large Language Models Solve Spatial-Temporal Problems on Dynamic Graphs? Oct 2023, arxiv
Talk like a Graph: En Graphs for Large Language Models, Oct 2023, arxiv
Open Graph Benchmark: Datasets for Machine Learning on Graphs, NeurIPS 2020
Can Language Models Solve Graph Problems in Natural Language? NeurIPS 2023
Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis, Aug 2023, [https://arxiv.org/abs/2308.11224]

Various unclassified tasks

(TODO as there are more than three papers per class, make a class a separate chapter in this Compendium)

OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions, Dec 2024, arxiv
Holmes ⌕ A Benchmark to Assess the Linguistic Competence of Language Models , Dec 2024, MIT Press Transactions of ACL, 2024
EscapeBench: Pushing Language Models to Think Outside the Box, Dec 2024, arxiv
OLMES: A Standard for Language Model Evaluations, Jun 2024, arxiv
Tulu 3: Pushing Frontiers in Open Language Model Post-Training, Nov 2024, arxiv see 7.1 Open Language Model Evaluation System (OLMES) and AllenAI Githib rep for Olmes
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making, Oct 2024, arxiv
Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks, Nov 2024, arxiv
Evaluating Superhuman Models with Consistency Checks, Apr 2024, IEEE
To the Globe (TTG): Towards Language-Driven Guaranteed Travel Planning, Meta AI, Oct 2024, arxiv evaluation for tasks of travel planning
Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job?, Oct 2024, arxiv, SelfScore, a novel benchmark designed to assess the performance of automated Large Language Model (LLM) agents on help desk and professional consultation task
Should We Really Edit Language Models? On the Evaluation of Edited Language Models, Oct 2024, arxiv
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs, EMNLP 2024, Oct 2024, arxiv, Repository for DyKnow
Jeopardy dataset at HuggingFace, huggingface
A framework for few-shot language model evaluation, Zenodo, Jul 2024, Zenodo
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, Aug 2023, arxiv
Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable?, Dec 2022, Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) Github repository for Human Evaluation Protocol
From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language Acquisition, Oct 2024, arxiv
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph, June 2024, arxiv
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style, Oct 2024, arxiv
Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study Mar 24, WSDM 24, ms blog
LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models, jul 2023 arxiv
OpenEQA: From word models to world models, Meta, Apr 2024, Understanding physical spaces by Models, Meta AI blog
Is Your LLM Outdated? Benchmarking LLMs & Alignment Algorithms for Time-Sensitive Knowledge. Apr 2024, arxiv
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv
LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents, Feb 2024, arxiv
Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,
A User-Centric Benchmark for Evaluating Large Language Models, Apr 2024, arxiv, data of user centric benchmark at github
RACE: Large-scale ReAding Comprehension Dataset From Examinations, 2017, arxiv RACE dataset at CMU
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models, 2020, arxiv CrowS-Pairs dataset
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs, Jun 2019, ACL data
RewardBench: Evaluating Reward Models for Language Modeling, Mar 2024, arxiv
Toward informal language processing: Knowledge of slang in large language models, EMNLP 2023, amazon science
FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability, Feb 2024, arxiv
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs, 05 2024,Bird, a big benchmark for large-scale database grounded in text-to-SQL tasks, containing 12,751 pairs of text-to-SQL data and 95 databases with a total size of 33.4 GB, spanning 37 professional domain arxiv data and leaderboard
MuSiQue: Multihop Questions via Single-hop Question Composition, Aug 2021, arxiv
Evaluating Copyright Takedown Methods for Language Models, June 2024, arxiv

LLM Systems

RAG Evaluation

Google Frames Dataset for evaluation of RAG systems, Sep 2024, [arxiv paper: Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository ](https://arxiv.org/abs/2409.12941) Hugging Face, dataset
RAGAS: Automated Evaluation of Retrieval Augmented Generation Jul 23, arxiv
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems Nov 23, arxiv
Evaluating Retrieval Quality in Retrieval-Augmented Generation, Apr 2024, arxiv
IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios, Sep 2024, arxiv

Conversational systems

And Dialog systems

Benchmark for general-purpose AI chat model, December 2024, AILuminate from ML Commons, mlcommons website
Introducing v0.5 of the AI Safety Benchmark from MLCommons, Apr 2024, arxiv
Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems, Jun 2024, arxiv
Simulated user feedback for the LLM production, TDS
How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs, Apr 2024, arxiv
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4, Jun 2024, arxiv

Copilots

Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv

Search and Recommendation Engines

Investigating Users' Search Behavior and Outcome with ChatGPT in Learning-oriented Search Tasks, SIGIR-AP 2024, ACM
Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation,RecSys 2023
Is ChatGPT a Good Recommender? A Preliminary Study Apr 2023 arxiv
IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios, Sep 2024, arxiv
LLMRec: Benchmarking Large Language Models on Recommendation Task, Aug 2023, arxiv
OpenP5: Benchmarking Foundation Models for Recommendation, Jun 2023, researchgate
Marqo embedding benchmark for eCommerce at Huggingface, text to image and category to image tasks
LaMP: When Large Language Models Meet Personalization, Apr 2023, arxiv
Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository
BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives, Feb 2024, arxiv
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents, Apr 2023, arxiv
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, Oct 2021, arxiv
BENCHMARK : LoTTE, Long-Tail Topic-stratified Evaluation for IR that features 12 domain-specific search tests, spanning StackExchange communities and using queries from GooAQ, ColBERT repository wth the benchmark data
LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed
Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,
STARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases, Apr 2024, arxiv code github
Constitutional AI: Harmlessness from AI Feedback, Sep 2022 arxiv (See Appendix B Identifying and Classifying Harmful Conversations, other parts)

Task Utility

Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications, Feb 2024, arxiv

Verticals

Healthcare and medicine

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models, Dec 2024, openreview arxiv benchmark code and data at github
A dataset and benchmark for hospital course summarization with adapted large language models, Dec 2024, Journal of the American Medical Informatics Association
A framework for human evaluation of large language models in healthcare derived from literature review, September 2024, Nature Digital Medicine
Evaluation and mitigation of cognitive biases in medical language models, Oct 2024 Nature
Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
Evaluating Generative AI Responses to Real-world Drug-Related Questions, June 2024, Psychiatry Research
Clinical Insights: A Comprehensive Review of Language Models in Medicine, Aug 2024, arxiv See table 2 for evaluation
Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data Jan 2024 arxiv
Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis, Jan 2024, arxiv
MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, 2022, PMLR
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams, MedQA benchmark, Sep 2020, arxiv
PubMedQA: A Dataset for Biomedical Research Question Answering, 2019, acl
Open Medical LLM Leaderboard from HF Explanation
Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics, Apr 2023, arxiv
Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery, Apr 2023, pub med
Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today, Jun 2023, arxiv
Evaluating the use of large language model in identifying top research questions in gastroenterology, Mar 2023, nature
Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv
MedDialog: Two Large-scale Medical Dialogue Datasets, Apr 2020, arxiv
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition, 2015, article html
DrugBank 5.0: a major update to the DrugBank database for 2018, 2018, paper html]
A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models, May 2024, nature, dataset
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records, Aug 2023, arxiv

Law

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models, NeurIPS 2023
LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain, EMNLP 2023
Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities NeurIPS 2022

Science

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations, 2022, arxiv
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks, NeurIPS 2023, NeurIPS 2023
GPQA: A Graduate-Level Google-Proof Q&A Benchmark, Nov 2023, arxiv
MATH Mathematics Aptitude Test of Heuristics, Measuring Mathematical Problem Solving With the MATH Dataset, Nov 2021 arxiv

Math

How well do large language models perform in arithmetic tasks?, Mar 2023, arxiv
FrontierMath at EpochAI, FrontierAI page, FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, Nov 2024, arxiv
Cmath: Can your language model pass chinese elementary school math test?, Jun 23, arxiv
GSM8K paperwithcode repository github

Financial

Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance, Jun 2023, arxiv
BloombergGPT: A Large Language Model for Finance (see Chapter 5 Evaluation), Mar 2023, arxiv
FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets, Oct 2023, arxiv

Other

Understanding the Capabilities of Large Language Models for Automated Planning, May 2023, arxiv

Other Collections

LLM/VLM Benchmarks by Aman Chadha
Awesome LLMs Evaluation Papers, a list of papers mentioned in the Evaluating Large Language Models: A Comprehensive Survey, Nov 2023

Citation

@article{Lopatenko2024CompendiumLLMEvaluation,
  title   = {Compendium of LLM Evaluation methods},
  author  = {Lopatenko, Andrei},
  year    = {2024},
  note    = {\url{https://github.com/alopatenko/LLMEvaluation}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 257 Commits
LLMEvaluation.pdf		LLMEvaluation.pdf
README.md		README.md
SearchAI.png		SearchAI.png
SmartCitiesGenerativeAI.pdf		SmartCitiesGenerativeAI.pdf
astrophotography		astrophotography
googlef005302cd7811953.html		googlef005302cd7811953.html
greg.png		greg.png

alopatenko/LLMEvaluation

Folders and files

Latest commit

History

Repository files navigation