Emergent Mind

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

(2401.13311)
Published Jan 24, 2024 in cs.CV , cs.AI , and cs.LG

Abstract

Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs' ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/

Overview

  • The paper introduces a new benchmark called C ON TEXTual designed to test large multimodal models (LMMs) on context-sensitive text-rich visual reasoning over real-world images.

  • C ON TEXTual comprises 506 diverse instructions across eight visual scenarios requiring both extractive and higher-order reasoning capabilities, including mathematical reasoning.

  • Experiments conducted with 13 foundation models, such as GPT-4V and Gemini-Pro-Vision, reveal that LMMs still lag significantly behind human performance, with a 30.8% gap.

  • GPT-4V and Gemini-Pro-Vision outperform other models in the tests but open-source LMMs underperform, indicating a performance gap between proprietary and open models.

  • The analysis highlights specific challenges in model reasoning, such as hallucination and improper grounding to images, although some LMMs occasionally outperform humans in interpreting abstract visual content like memes.

Evaluation of Context-Sensitive Text-Rich Visual Reasoning

Introduction

The advent of instruction-tuned large multimodal models (LMMs) has led to heightened capabilities in responding to human instructions over images. Recent datasets have focused on assessing the Optical Character Recognition (OCR) ability of models, but this falls short in testing the full potential of LMMs to jointly reason over the text and visual context in an image. To bridge this gap, the paper introduces the benchmark C ON TEXTUAL, designed to evaluate the LMMs’ ability to perform context-sensitive reasoning over diverse and challenging real-world scenarios.

C ON T EXTUAL Dataset

C ON TEXTUAL consists of 506 challenging instructions testing LMMs across eight visual scenarios representing daily-life natural or digital scenes. This dataset demands a joint reasoning between the textual and visual cues, something prior datasets do not incentivize sufficiently. The instructions include open-ended questions and imperative tasks, demanding extractive as well as reasoning capabilities beyond information extraction, including mathematical reasoning.

Experimental Setup and Findings

A comprehensive set of experiments were conducted with 13 foundation models, including both proprietary (e.g., GPT-4V, Gemini-Pro-Vision) and open LMMs (e.g., LLaVA-1.5). The findings indicate GPT-4V(ision) outstripping other LMMs, even though it still lags behind human performance by 30.8%. There is a notable performance disparity in open LMMs in comparison to proprietary models, pointing to a need for future advancements that narrow this divide.

Model Performance and Analysis

Qualitative analysis elucidates a range of performance levels, with GPT4V and Gemini-Pro-Vision showcasing superior context-sensitive text-rich visual reasoning, whereas open-source LMMs underperform considerably. The analysis further helps identify issues like hallucination and lack of grounding instructions to the image. Interestingly, in certain abstract categories like memes and quotes, GPT-4V exceeds human performance, indicating the potential for tuning LMMs for better visual context understanding. Additionally, the benchmark C ON T EXTUAL demonstrates the challenging nature and gap present in modern LMMs when it comes to context-sensitive text-rich visual reasoning tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.