Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences (2401.10529v2)
Abstract: Multimodal LLMs (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.
- Combating the compounding-error problem with a multi-step model.
- Vision-language models as a source of rewards. arXiv preprint arXiv:2312.09187.
- Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. arXiv preprint arXiv:2311.14906.
- Hallucination detection: Robustly discerning reliable answers in large language models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 245–255.
- Mitigating hallucination in visual language models with visual supervision. arXiv preprint arXiv:2311.16479.
- Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864.
- Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
- Hallucination augmented contrastive learning for multimodal large language model. arXiv preprint arXiv:2312.06968.
- Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046.
- Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
- Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566.
- Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565.
- Improved baselines with visual instruction tuning.
- C-disentanglement: Discovering causally-independent generative factors under an inductive bias of confounder. arXiv preprint arXiv:2310.17325.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
- On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895.
- Liv: Language-image representations and rewards for robotic control. arXiv preprint arXiv:2306.00958.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
- OpenAI. 2023a. Gpt-4 technical report.
- OpenAI. 2023b. Gpt-4v(ision) system card.
- Vision-language models are zero-shot reward models for reinforcement learning. arXiv preprint arXiv:2310.12921.
- Object hallucination in image captioning. arXiv preprint arXiv:1809.02156.
- Roboclip: One demonstration is enough to learn robot policies. In Thirty-seventh Conference on Neural Information Processing Systems.
- Gemini Team. 2023. Gemini: A family of highly capable multimodal models.
- Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126.
- Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. arXiv preprint arXiv:2312.01701.
- Coplanner: Plan to roll out conservatively but to explore optimistically for model-based rl.
- Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257.
- Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687.
- Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. arXiv preprint arXiv:2311.13614.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
- How language model hallucinations can snowball.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
- Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839.
- Minigpt-5: Interleaved vision-and-language generation via generative vokens.
- Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754.
- Scalable prompt generation for semi-supervised learning with language models. arXiv preprint arXiv:2302.09236.
- Explore spurious correlations at the concept level in language models for text classification. arXiv preprint arXiv:2311.08648.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.