The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across. I aim to create a resource that will enable anyone with queries—whether it's about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.
My view on LLM Evaluation: Deck, and SF Big Analytics and AICamp video Analytics Vidhya (Data Phoenix Mar 5) (by Andrei Lopatenko)
- Reviews and Surveys
- Leaderboards and Arenas
- Evaluation Software
- LLM Evaluation articles in tech media and blog posts from companies
- Large benchmarks
- Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation
- Long Comprehensive Studies
- HITL (Human in the Loop)
- LLM as Judge
- LLM Evaluation
- Embeddings
- In Context Learning
- Hallucinations
- Question Answering
- Multi Turn
- Reasoning
- Multi-Lingual
- Multi-Modal
- Instruction Following
- Ethical AI
- Biases
- Safe AI
- Cybersecurity
- Code Generating LLMs
- Summarization
- LLM quality (generic methods: overfitting, redundant layers etc)
- Software Performace (latency, throughput, memory, storage)
- Agent LLM architectures
- Long Text Generation
- Graph Understandings
- Various unclassified tasks
- LLM Systems
- Other collections
- Citation
- Evaluating Large Language Models: A Comprehensive Survey , Oct 2023 arxiv:
- A Survey on Evaluation of Large Language Models Jul 2023 arxiv:
- Through the Lens of Core Competency: Survey on Evaluation of Large Language Models, Aug 2023 , arxiv:
- New Hard Leaderboard by HuggingFace leaderboard description, blog post
- LMSys Arena (explanation:)
- Salesforce's Contextual Bench leaderboard hugging face an overview of how different LLMs perform across a variety of contextual tasks,
- OpenGPT-X Multi- Lingual European LLM Leaderboard, evaluation of LLMs for many European languages - on HuggingFace
- OpenLLM Leaderboard
- MTEB
- SWE Bench
- AlpacaEval leaderboard Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv code
- Open Medical LLM Leaderboard from HF Explanation
- Gorilla, Berkeley function calling Leaderboard Explanation
- WildBench WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
- Enterprise Scenarios, Patronus
- Vectara Hallucination Leaderboard
- Ray/Anyscale's LLM Performance Leaderboard (explanation:)
- Hugging Face LLM Performance hugging face leaderboard
- Multi-task Language Understanding on MMLU
- EleutherAI LLM Evaluation Harness
- Eureka, Microsoft, A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. github Sep 2024 arxiv
- OpenAI Evals
- ConfidentAI DeepEval
- MTEB
- OpenICL Framework
- RAGAS
- ML Flow Evaluate
- MosaicML Composer
- Toolkit from Mozilla AI for LLM as judge evaluation tool: lm-buddy eval tool model: Prometheus
- TruLens
- Promptfoo
- BigCode Evaluation Harness
- LangFuse
- LLMeBench see LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
- ChainForge
- Ironclad Rivet
- LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models, arxiv pdf github repository
---
- Evaluate LLMs using Evaluation Harness and Hugging Face TGI/vLLM, Sep 2024, blog
- Let's talk about LLM Evaluation, HuggingFace, article
- Using LLMs for Evaluation LLM-as-a-Judge and other scalable additions to human quality ratings. Aug 2024, Deep Learning Focus
- Introducing SimpleQA, OpenAI, Oct 2024 OpenAI
- Catch me if you can! How to beat GPT-4 with a 13B model, LM sys org
- Why it’s impossible to review AIs, and why TechCrunch is doing it anyway Techcrun mat 2024
- A.I. has a measurement problem, NY Times, Apr 2024
- Beyond Accuracy: The Changing Landscape Of AI Evaluation, Forbes, Mar 2024
- Mozilla AI Exploring LLM Evaluation at scale
- Evaluation part of How to Maximize LLM Performance
- Mozilla AI blog published multiple good articles in Mozilla AI blog
- Andrej Karpathy on evaluation X
- From Meta on evaluation of Llama 3 models github
- DeepMind AI Safety evaluation June 24 deepmind blog, Introducing Frontier Safety Framework
- AI Snake Oil, June 2024, AI leaderboards are no longer useful. It's time to switch to Pareto curves.
- Hamel Dev March 2024, Your AI Product Needs Eval. How to construct domain-specific LLM evaluation systems
-
MMLU Pro Massive Multitask Language Understanding - Pro version, Jun 2024, arxiv
-
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks EMNLP 2022, pdf
-
Measuring Massive Multitask Language Understanding, MMLU, ICLR, 2021, arxiv MMLU dataset
-
BigBench: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, 2022, arxiv, datasets
-
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Oct 2022, arxiv
- Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks, May 2024, ICML 2024, arxiv
- A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations, EMNLP 2024, ACLAnthology
- Lessons from the Trenches on Reproducible Evaluation of Language Models, May 2024, arxiv
- Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat, Nov 2024, arxiv
- Sabotage Evaluations for Frontier Models, Anthropic, Nov 2024, paper blog post
- AI Benchmarks and Datasets for LLM Evaluation, Dec 2024, arxiv, a survey of many LLM becnmarks
- Lessons from the Trenches on Reproducible Evaluation of Language Models, May 2024, arxiv
- Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks, Aug 2024, ACL 2024
- Synthetic data in evaluation, see Chapter 3 in Best Practices and Lessons Learned on Synthetic Data for Language Models, Apr 2024, arxiv
- Elo Uncovered: Robustness and Best Practices in Language Model Evaluation, Nov 2023 arxiv
- When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards, Feb 2024, arxiv
- Are Emergent Abilities of Large Language Models a Mirage? Apr 23 arxiv
- Don't Make Your LLM an Evaluation Benchmark Cheater nov 2023 arxiv
- Evaluating Question Answering Evaluation, 2019, ACL
- Evaluating Open-QA Evaluation, 2023, arxiv
- (RE: stat methods ) Prediction-Powered Inference jan 23 arxiv PPI++: Efficient Prediction-Powered Inference nov 23, arxiv
- Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress, Feb 2024, arxiv
- Are We on the Right Way for Evaluating Large Vision-Language Models?, Apr 2024, arxiv
- What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Apr 2024, arxiv
- Detecting Pretraining Data from Large Language Models, Oct 2023, arxiv
- Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, Apr 2024, arxiv
- Faithful model evaluation for model-based metrics, EMNLP 2023, amazon science
- Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models, ICML 2023, mlr press
- AI Snake Oil, June 2024, AI leaderboards are no longer useful. It's time to switch to Pareto curves.
- State of What Art? A Call for Multi-Prompt LLM Evaluation , Aug 2024, Transactions of the Association for Computational Linguistics (2024) 12
- TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
- Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv
- Evaluation of OpenAI o1: Opportunities and Challenges of AGI, Sep 2024, arxiv
- Evaluating Question Answering Evaluation, 2019, ACL
- Developing a Framework for Auditing Large Language Models Using Human-in-the-Loop, Feb 2024, arxiv
- Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation, Nov 2023, arxiv
- LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, Dec 2024, arxiv
- Large Language Models are Inconsistent and Biased Evaluators, May 2024, arxiv
- Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation, Oct 2024, arxiv
- Evaluating LLMs at Detecting Errors in LLM Responses, Apr 2024, arxiv
- Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models, Apr 2024, arxiv
- Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries, Sep 2024, arxiv
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Jun 2023, arxiv
- Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, Jum 2024, arxiv
- Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv leaderboard code
- Discovering Language Model Behaviors with Model-Written Evaluations, Dec 2022, arxiv
- The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate, Feb 2024, arxiv
- Benchmarking Foundation Models with Language-Model-as-an-Examiner, 2022, NEURIPS
- Red Teaming Language Models with Language Models, Feb 2022, arxiv
- ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate, Aug 2023, arxiv
- ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning, Sep 2023, arxiv
- Style Over Substance: Evaluation Biases for Large Language Models, Jul 2023, arxiv
- Large Language Models Are State-of-the-Art Evaluators of Translation Quality, Feb 2023, arxiv
- Large Language Models Are State-of-the-Art Evaluators of Code Generation, Apr 2023, researchgate
- Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators, Mar 2024, arxiv
- LLM Evaluators Recognize and Favor Their Own Generations, Apr 2024, pdf
- Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences, Apr 2024, arxiv
- Using LLMs for Evaluation LLM-as-a-Judge and other scalable additions to human quality ratings. Aug 2024, Deep Learning Focus
- MTEB: Massive Text Embedding Benchmark Oct 2022 [arxiv](https://arxiv.org/abs/2210.07316 Leaderboard) Leaderboard
- Marqo embedding benchmark for eCommerce at Huggingface, text to image and category to image tasks
- The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding, openreview pdf
- MMTEB: Community driven extension to MTEB repository
- Chinese MTEB C-MTEB repository
- French MTEB repository
- HellaSwag, HellaSwag: Can a Machine Really Finish Your Sentence? 2019, arxiv Paper + code + dataset https://rowanzellers.com/hellaswag/
- The LAMBADA dataset: Word prediction requiring a broad discourse context 2016, arxiv
- TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS: A PRINCIPLE AND BENCHMARK, Jan 2024, arxiv
- INVITE: A testbed of automatically generated invalid questions to evaluate large language models for hallucinations, EMNLP 2023, amazon science
- A Survey of Hallucination in Large Visual Language Models, Oct 2024, See Chapter IV, Evaluation of Hallucinations arxiv
- Generating Benchmarks for Factuality Evaluation of Language Models, Jul 2023, arxiv
- HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models, Dec 2023, ACL
- Long-form factuality in large language models, Mar 2024, arxiv
- Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, Sep 2023, arxiv
- Measuring Faithfulness in Chain-of-Thought Reasoning, Jul 2023, [arxiv
- FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation, May 2023, arxiv repository
- Introducing SimpleQA, OpenAI, Oct 2024 OpenAI
QA is used in many vertical domains, see Vertical section bellow
- Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository
- Introducing SimpleQA, OpenAI, Oct 2024 OpenAI
- Are Large Language Models Consistent over Value-laden Questions?, Jul 2024, arxiv
- CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, Jun 2019, ACL
- Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, Sep 2018, arxiv OpenBookQA dataset at AllenAI
- Jin, Di, et al. "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams., 2020, arxiv MedQA
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018, arxiv ARC Easy dataset ARC dataset
- BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions, 2019, arxiv BoolQ dataset
- HellaSwag, HellaSwag: Can a Machine Really Finish Your Sentence? 2019, arxiv Paper + code + dataset https://rowanzellers.com/hellaswag/
- PIQA: Reasoning about Physical Commonsense in Natural Language, Nov 2019, arxiv PIQA dataset
- Crowdsourcing Multiple Choice Science Questions arxiv SciQ dataset
- WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2017, arxiv Winogrande dataset
- TruthfulQA: Measuring How Models Mimic Human Falsehoods, Sep 2021, arxiv
- TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages, 2020, arxiv data
- Natural Questions: A Benchmark for Question Answering Research, Transactions ACL 2019
- LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models Nov 2023, arxiv
- MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues Feb 24 arxiv
- How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv
- Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models, Oct 2023, arxiv
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, NeurIPS 2023, NeurIPS
- MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback, Sep 2023, arxiv
- FrontierMath at EpochAI, FrontierAI page, FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, Nov 2024, arxiv
- Easy Problems That LLMs Get Wrong, May 2024, arxiv, a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) in domains such as logical reasoning, spatial intelligence, and linguistic understanding
- Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning, NeurIPS 2024 Track Datasets and Benchmarks Spotlight, Sep 2024, OpenReview
- Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks 2023, arxiv
- LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, arxiv
- Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
- Competition-Level Problems are Effective LLM Evaluators, Dec 23, arxiv
- Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, Apr 2024, arxiv
- MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Oct 2023, arxiv
- AlGhafa Evaluation Benchmark for Arabic Language Models Dec 23, ACL Anthology ACL pdf article
- CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian, Dec 2024, Tenth Italian Conference on Computational Linguistics,
- Evaluating and Advancing Multimodal Large Language Models in Ability Lens, Nov 2024, arxiv
- Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem HF blog
- Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese , Apr 2024 arxiv
- BanglaQuAD: A Bengali Open-domain Question Answering Dataset, Oct 2024, arxiv
- AlignBench: Benchmarking Chinese Alignment of Large Language Models, Nov 2023, arxiv
- The Invalsi Benchmark: measuring Language Models Mathematical and Language understanding in Italian, Mar 2024, arxiv
- MEGA: Multilingual Evaluation of Generative AI, Mar 2023, arxiv
- M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models, 2023, NIPS website
- LAraBench: Benchmarking Arabic AI with Large Language Models, May 23, arxiv
- Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?, Apr 2024, arxiv
-
The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding, openreview pdf
-
Chinese MTEB C-MTEB repository
-
French MTEB repository
-
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, May 2023, arxiv
-
LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models, Nov 2024, IEEE
-
ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?, Dec 2024, arxiv
-
RealWorldQA, Apr 2024, HuggingFace
-
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models, Oct 2024, arxiv
-
MMBench: Is Your Multi-modal Model an All-Around Player?, Oct 2024 springer ECCV 2024
-
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models, Oct 2024, arxiv
-
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI, Apr 2024, arxiv
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, CVPR 2024, CVPR
-
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models, Dec 2024, open review github for the benchmark and evaluation framework
-
Careless Whisper: Speech-to-Text Hallucination Harms, FAccT '24, ACM
-
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?, Oct 2024 arxiv
-
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning, Oct 2024, Computer Vision – ECCV 2024
-
VHELM: A Holistic Evaluation of Vision Language Models, Oct 2024, arxiv
-
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models, Reka AI, May 2024 arxiv dataset blog post
-
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis, Aug 2024, arxiv
-
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models, Jun 2024, arxiv
-
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models, Jun 2024, arxiv
-
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models, Jun 2024, arxiv
-
Holistic Evaluation of Text-to-Image Models Nov 23 arxiv
-
VBench: Comprehensive Benchmark Suite for Video Generative Models Nov 23 arxiv
-
Evaluating Text-to-Visual Generation with Image-to-Text Generation, Apr 2024, arxiv
-
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Apr 2024, arxiv
-
Are We on the Right Way for Evaluating Large Vision-Language Models?, Apr 2024, arxiv
-
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning, Nov 2023, arxiv
-
BLINK: Multimodal Large Language Models Can See but Not Perceive, Apr 2024, arxiv github
-
Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, Apr 2024, arxiv
-
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, Apr 2024, arxiv
-
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models, Apr 2024, arxiv
-
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts, Oct 2023, [arxiv](MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts)
-
Evaluation part of https://arxiv.org/abs/2404.18930, Apr 2024, arxiv, repository
-
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use, Aug 2023. arxiv
-
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities, Aug 2023, arxiv
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension, Jul 2023, arxiv
-
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark, NeurIPS 2023, NeurIPS
-
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality, Apr 2023 arxiv
-
Evaluating Large Language Models at Evaluating Instruction Following Oct 2023, arxiv
-
Find the INTENTION OF INSTRUCTION: Comprehensive Evaluation of Instruction Understanding for Large Language Models, Dec 2024, arxiv
-
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models, Dec 2024, arxiv
-
CFBench: A Comprehensive Constraints-Following Benchmark for LLMs. Aug 2024, arxiv
-
Instruction-Following Evaluation for Large Language Models, IFEval, Nov 2023, arxiv
-
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets, Jul 2023, arxiv , FLASK dataset
-
DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation, Mar 2024, aaai, pdf
-
LongForm: Effective Instruction Tuning with Reverse Instructions, Apr 2023, arxiv dataset
-
Evaluating the Moral Beliefs Encoded in LLMs, Jul 23 arxiv
-
AI Deception: A Survey of Examples, Risks, and Potential Solutions Aug 23 arxiv
-
Aligning AI With Shared Human Value, Aug 20 - Feb 23, arxiv Re: ETHICS benchmark
-
What are human values, and how do we align AI to them?, Mar 2024, pdf
-
TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
-
Helpfulness, Honesty, Harmlessness (HHH) framework from Antrhtopic, introduced in A General Language Assistantas a Laboratory for Alignment, 2021, arxiv, it's in BigBench now bigbench
-
WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models, April 2024, arxiv
-
Chapter 19 in The Ethics of Advanced AI Assistants, Apr 2024, Google DeepMind, pdf at google
-
BEHONEST: Benchmarking Honesty of Large Language Models, June 2024, arxiv
-
FairPair: A Robust Evaluation of Biases in Language Models through Paired Perturbations, Apr 2024 arxiv
-
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation, 2021, arxiv, dataset
-
“I’m fully who I am”: Towards centering transgender and non-binary voices to measure biases in open language generation, ACM FAcct 2023, amazon science
-
This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models, May 2023, arxiv
- Benchmark for general-purpose AI chat model, December 2024, AILuminate from ML Commons, mlcommons website
- ECCV 2024 MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models, Jan 2024, github arxiv nov 2023
- Introducing v0.5 of the AI Safety Benchmark from MLCommons, Apr 2024, arxiv
- SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, Oct 2024, arxiv
- Beyond Prompt Brittleness: Evaluating the Reliability and Consistency of Political Worldviews in LLMs , Nov 2024, MIT Press
- Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems, Jan 2024, arxiv
- LLMSecCode: Evaluating Large Language Models for Secure Coding, Aug 2024, arxiv
- Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI, Sep 2024, arxiv
- DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection, Sep 2024, arxiv
- Purple Llama, an umbrella project from Meta, Purple Llama repository
- Explore, Establish, Exploit: Red Teaming Language Models from Scratch, Jun 2023, arxiv
- Rethinking Backdoor Detection Evaluation for Language Models, Aug 2024, arxiv pdf
- Gradient-Based Language Model Red Teaming, Jan 24, arxiv
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models, Mar 2024, arxiv
- Announcing a Benchmark to Improve AI Safety MLCommons has made benchmarks for AI performance—now it's time to measure safety, Apr 2024 IEEE Spectrum
- Model evaluation for extreme risks, May 2023, arxiv
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Jan 2024, arxiv
- CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models, July 2023, Meta arxiv
- CYBERSECEVAL 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models, Apr 2024, Meta arxiv
- Benchmarking OpenAI o1 in Cyber Security, Oct 2024, arxiv
- Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models, Aug 2024, arxiv
- Evaluating Large Language Models Trained on Code HumanEval Jul 2022 arxiv
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation Feb 21 arxiv
- SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, Oct 2024, arxiv
- LLMSecCode: Evaluating Large Language Models for Secure Coding, Aug 2024, arxiv
- Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
- SWE Bench SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Feb 2024 arxiv Tech Report
- Gorilla Functional Calling Leaderboard, Berkeley Leaderboard
- DevBench: A Comprehensive Benchmark for Software Development, Mar 2024,arxiv
- MBPP (Mostly Basic Python Programming) benchmark, introduced in Program Synthesis with Large Language Models , 2021 papers with code data
- CodeMind: A Framework to Challenge Large Language Models for Code Reasoning, Feb 2024, arxiv
- CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution, Jan 2024, arxiv
- CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning, Jul 2022, arxiv code at salesforce github
- Human-like Summarization Evaluation with ChatGPT, Apr 2023, arxiv
- A dataset and benchmark for hospital course summarization with adapted large language models, Dec 2024, Journal of the American Medical Informatics Association
- Evaluating the Factual Consistency of Large Language Models Through News Summarization, Nov 2022, arxiv
- USB: A Unified Summarization Benchmark Across Tasks and Domains, May 2023, arxiv
- WikiAsp: A Dataset for Multi-domain Aspect-based Summarization, 2021, Transactions ACL dataset
- Ray/Anyscale's LLM Performance Leaderboard (explanation:)
- MLCommons MLPerf benchmarks (inference) MLPerf announcement of the LLM track
- Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making, Oct 2024, arxiv
- Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena, Oct 2023, arxiv
- LLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games, Sep 2023,arxiv
- AgentBench: Evaluating LLMs as Agents, Aug 2023, arxiv
- How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Mar 2024, arxiv
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents, Jan 2024, arxiv
- Suri: Multi-constraint Instruction Following for Long-form Text Generation, Jun 2024, arxiv
- LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, Aug 2024, arxiv
- LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding, Aug 2023, arxiv
- HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models, Sep 2024, arxiv
- GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking, May 2023, arxiv
- LLM4DyG: Can Large Language Models Solve Spatial-Temporal Problems on Dynamic Graphs? Oct 2023, arxiv
- Talk like a Graph: En Graphs for Large Language Models, Oct 2023, arxiv
- Open Graph Benchmark: Datasets for Machine Learning on Graphs, NeurIPS 2020
- Can Language Models Solve Graph Problems in Natural Language? NeurIPS 2023
- Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis, Aug 2023, [https://arxiv.org/abs/2308.11224]
(TODO as there are more than three papers per class, make a class a separate chapter in this Compendium)
- OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions, Dec 2024, arxiv
- Holmes ⌕ A Benchmark to Assess the Linguistic Competence of Language Models , Dec 2024, MIT Press Transactions of ACL, 2024
- EscapeBench: Pushing Language Models to Think Outside the Box, Dec 2024, arxiv
- OLMES: A Standard for Language Model Evaluations, Jun 2024, arxiv
- Tulu 3: Pushing Frontiers in Open Language Model Post-Training, Nov 2024, arxiv see 7.1 Open Language Model Evaluation System (OLMES) and AllenAI Githib rep for Olmes
- Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making, Oct 2024, arxiv
- Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks, Nov 2024, arxiv
- Evaluating Superhuman Models with Consistency Checks, Apr 2024, IEEE
- To the Globe (TTG): Towards Language-Driven Guaranteed Travel Planning, Meta AI, Oct 2024, arxiv evaluation for tasks of travel planning
- Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job?, Oct 2024, arxiv, SelfScore, a novel benchmark designed to assess the performance of automated Large Language Model (LLM) agents on help desk and professional consultation task
- Should We Really Edit Language Models? On the Evaluation of Edited Language Models, Oct 2024, arxiv
- DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs, EMNLP 2024, Oct 2024, arxiv, Repository for DyKnow
- Jeopardy dataset at HuggingFace, huggingface
- A framework for few-shot language model evaluation, Zenodo, Jul 2024, Zenodo
- AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, Aug 2023, arxiv
- Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable?, Dec 2022, Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) Github repository for Human Evaluation Protocol
- From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language Acquisition, Oct 2024, arxiv
- DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph, June 2024, arxiv
- RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style, Oct 2024, arxiv
- Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study Mar 24, WSDM 24, ms blog
- LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models, jul 2023 arxiv
- OpenEQA: From word models to world models, Meta, Apr 2024, Understanding physical spaces by Models, Meta AI blog
- Is Your LLM Outdated? Benchmarking LLMs & Alignment Algorithms for Time-Sensitive Knowledge. Apr 2024, arxiv
- ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv
- LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed
- LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents, Feb 2024, arxiv
- Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,
- A User-Centric Benchmark for Evaluating Large Language Models, Apr 2024, arxiv, data of user centric benchmark at github
- RACE: Large-scale ReAding Comprehension Dataset From Examinations, 2017, arxiv RACE dataset at CMU
- CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models, 2020, arxiv CrowS-Pairs dataset
- DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs, Jun 2019, ACL data
- RewardBench: Evaluating Reward Models for Language Modeling, Mar 2024, arxiv
- Toward informal language processing: Knowledge of slang in large language models, EMNLP 2023, amazon science
- FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability, Feb 2024, arxiv
- Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs, 05 2024,Bird, a big benchmark for large-scale database grounded in text-to-SQL tasks, containing 12,751 pairs of text-to-SQL data and 95 databases with a total size of 33.4 GB, spanning 37 professional domain arxiv data and leaderboard
- MuSiQue: Multihop Questions via Single-hop Question Composition, Aug 2021, arxiv
- Evaluating Copyright Takedown Methods for Language Models, June 2024, arxiv
- Google Frames Dataset for evaluation of RAG systems, Sep 2024, [arxiv paper: Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
- Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository ](https://arxiv.org/abs/2409.12941) Hugging Face, dataset
- RAGAS: Automated Evaluation of Retrieval Augmented Generation Jul 23, arxiv
- ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems Nov 23, arxiv
- Evaluating Retrieval Quality in Retrieval-Augmented Generation, Apr 2024, arxiv
- IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios, Sep 2024, arxiv
And Dialog systems
-
Benchmark for general-purpose AI chat model, December 2024, AILuminate from ML Commons, mlcommons website
-
Introducing v0.5 of the AI Safety Benchmark from MLCommons, Apr 2024, arxiv
-
Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
-
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems, Jun 2024, arxiv
-
Simulated user feedback for the LLM production, TDS
-
How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv
-
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs, Apr 2024, arxiv
-
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4, Jun 2024, arxiv
- Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
- ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv
-
Investigating Users' Search Behavior and Outcome with ChatGPT in Learning-oriented Search Tasks, SIGIR-AP 2024, ACM
-
Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation,RecSys 2023
-
Is ChatGPT a Good Recommender? A Preliminary Study Apr 2023 arxiv
-
IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios, Sep 2024, arxiv
-
LLMRec: Benchmarking Large Language Models on Recommendation Task, Aug 2023, arxiv
-
OpenP5: Benchmarking Foundation Models for Recommendation, Jun 2023, researchgate
-
Marqo embedding benchmark for eCommerce at Huggingface, text to image and category to image tasks
-
LaMP: When Large Language Models Meet Personalization, Apr 2023, arxiv
-
Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository
-
BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives, Feb 2024, arxiv
-
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents, Apr 2023, arxiv
-
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, Oct 2021, arxiv
-
BENCHMARK : LoTTE, Long-Tail Topic-stratified Evaluation for IR that features 12 domain-specific search tests, spanning StackExchange communities and using queries from GooAQ, ColBERT repository wth the benchmark data
-
LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed
-
Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,
-
STARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases, Apr 2024, arxiv code github
-
Constitutional AI: Harmlessness from AI Feedback, Sep 2022 arxiv (See Appendix B Identifying and Classifying Harmful Conversations, other parts)
- Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications, Feb 2024, arxiv
- MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models, Dec 2024, openreview arxiv benchmark code and data at github
- A dataset and benchmark for hospital course summarization with adapted large language models, Dec 2024, Journal of the American Medical Informatics Association
- A framework for human evaluation of large language models in healthcare derived from literature review, September 2024, Nature Digital Medicine
- Evaluation and mitigation of cognitive biases in medical language models, Oct 2024 Nature
- Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
- Evaluating Generative AI Responses to Real-world Drug-Related Questions, June 2024, Psychiatry Research
- Clinical Insights: A Comprehensive Review of Language Models in Medicine, Aug 2024, arxiv See table 2 for evaluation
- Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data Jan 2024 arxiv
- Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis, Jan 2024, arxiv
- MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, 2022, PMLR
- What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams, MedQA benchmark, Sep 2020, arxiv
- PubMedQA: A Dataset for Biomedical Research Question Answering, 2019, acl
- Open Medical LLM Leaderboard from HF Explanation
- Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics, Apr 2023, arxiv
- Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery, Apr 2023, pub med
- Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today, Jun 2023, arxiv
- Evaluating the use of large language model in identifying top research questions in gastroenterology, Mar 2023, nature
- Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv
- MedDialog: Two Large-scale Medical Dialogue Datasets, Apr 2020, arxiv
- An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition, 2015, article html
- DrugBank 5.0: a major update to the DrugBank database for 2018, 2018, paper html]
- A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models, May 2024, nature, dataset
- MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records, Aug 2023, arxiv
- LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models, NeurIPS 2023
- LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain, EMNLP 2023
- Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities NeurIPS 2022
-
SciRepEval: A Multi-Format Benchmark for Scientific Document Representations, 2022, arxiv
-
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks, NeurIPS 2023, NeurIPS 2023
-
GPQA: A Graduate-Level Google-Proof Q&A Benchmark, Nov 2023, arxiv
-
MATH Mathematics Aptitude Test of Heuristics, Measuring Mathematical Problem Solving With the MATH Dataset, Nov 2021 arxiv
- How well do large language models perform in arithmetic tasks?, Mar 2023, arxiv
- FrontierMath at EpochAI, FrontierAI page, FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, Nov 2024, arxiv
- Cmath: Can your language model pass chinese elementary school math test?, Jun 23, arxiv
- GSM8K paperwithcode repository github
- Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
- PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance, Jun 2023, arxiv
- BloombergGPT: A Large Language Model for Finance (see Chapter 5 Evaluation), Mar 2023, arxiv
- FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets, Oct 2023, arxiv
- Understanding the Capabilities of Large Language Models for Automated Planning, May 2023, arxiv
- LLM/VLM Benchmarks by Aman Chadha
- Awesome LLMs Evaluation Papers, a list of papers mentioned in the Evaluating Large Language Models: A Comprehensive Survey, Nov 2023
@article{Lopatenko2024CompendiumLLMEvaluation,
title = {Compendium of LLM Evaluation methods},
author = {Lopatenko, Andrei},
year = {2024},
note = {\url{https://github.com/alopatenko/LLMEvaluation}}
}