Furong Huang
Associate Professor @ University of Maryland
Home
Publications
Research
Project Page Highlights
Students
Teaching
Blog
Contact
Students on the Job Market
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
Publications
Year
2024
Type(s)
Conference proceedings
Author(s)
Wang, Xiyao, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Fuxiao Liu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, and Furong Huang.
Source
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024.
BibTeX
BibTeX
BibTeX
@inproceedings{wang-etal-2024-mementos, title = "Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences", author = "Wang, Xiyao and Zhou, Yuhang and Liu, Xiaoyu and Lu, Hongjin and Xu, Yuancheng and He, Feihong and Yoon, Jaehong and Lu, Taixi and Liu, Fuxiao and Bertasius, Gedas and Bansal, Mohit and Yao, Huaxiu and Huang, Furong", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-long.25", pages = "416--442", abstract = "Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs{'} sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs{'} sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of co-occurring behaviors, and the compounding impact of behavioral hallucinations.", }