MuirBench

A Comprehensive Benchmark for Robust Multi-image Understanding

🌐 Homepage | 🤗 Dataset | 📖 Paper | 💻 Evaluation

News

🔥 [2024-09-03] To ensure consistent evaluation, we released the preprocessing code for creating prompts and the postprocessing code for parsing predictions.
🔥 [2024-07-15] MuirBench is now on LMMs-Eval, enabling rapid evaluation on multimodal LLMs.
🔥 [2024-06-13] MuirBench is released.

Intro

MuirBench is a benchmark containing 11,264 images and 2,600 multiple-choice questions, providing robust evaluation on 12 multi-image understanding tasks.

MuirBench evaluates on a comprehensive range of 12 multi-image understanding abilities, e.g. geographic understanding, diagram understanding, visual retrieval, ..., etc, while prior benchmarks generally contain single-image questions.
MuirBench contains 10 diverse multi-image relations, e.g. narrative, complementary, etc.
MuirBench provides a robust evaluation on models by unanswerable instance variants. Three major ways to create the unanswerable instances are as below.

Results

Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.

Disclaimer

MuirBench incorporates data sourced from established image datasets. Every effort has been made to ensure that the data presented in this paper is utilized in compliance with relevant copyright laws and appropriately credited. Should any copyright holder identify an image in our work that they believe infringes upon their licensing agreements, we invite them to contact us directly. We are committed to addressing any legitimate concerns in a timely and responsible manner.

Contact

Fei Wang: [email protected]
Xingyu Fu: [email protected]

Citation

@article{wang2024muirbench,
  title={MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding},
  author={Wang, Fei and Fu, Xingyu and Huang, James Y and Li, Zekun and Liu, Qin and Liu, Xiaogeng and Ma, Mingyu Derek and Xu, Nan and Zhou, Wenxuan and Zhang, Kai and others},
  journal={arXiv preprint arXiv:2406.09411},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
eval/utils		eval/utils
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MuirBench

News

Intro

Results

Disclaimer

Contact

Citation

About

Releases

Packages

Languages

muirbench/MuirBench

Folders and files

Latest commit

History

Repository files navigation

MuirBench

News

Intro

Results

Disclaimer

Contact

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages