Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 213 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models (2401.13311v3)

Published 24 Jan 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Many real-world tasks require an agent to reason jointly over text and visual objects, (e.g., navigating in public spaces), which we refer to as context-sensitive text-rich visual reasoning. Specifically, these tasks require an understanding of the context in which the text interacts with visual elements within an image. However, there is a lack of existing datasets to benchmark the state-of-the-art multimodal models' capability on context-sensitive text-rich visual reasoning. In this paper, we introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models (GPT-4V, Gemini-Pro-Vision, LLaVA-Next) and establish a human performance baseline. Further, we perform human evaluations of the model responses and observe a significant performance gap of 30.8% between GPT-4V (the current best-performing Large Multimodal Model) and human performance. Our fine-grained analysis reveals that GPT-4V encounters difficulties interpreting time-related data and infographics. However, it demonstrates proficiency in comprehending abstract visual contexts such as memes and quotes. Finally, our qualitative analysis uncovers various factors contributing to poor performance including lack of precise visual perception and hallucinations. Our dataset, code, and leaderboard can be found on the project page https://con-textual.github.io/

Citations (10)

Summary

  • The paper introduces a novel benchmark, ConTextual, that tests large multimodal models’ ability to reason based on text embedded in diverse visual contexts.
  • It assesses models across eight realistic scenarios including extractive, mathematical, and open-ended reasoning challenges.
  • Experiments show proprietary models like GPT-4V outperform open-source LMMs, yet still trail behind human performance by over 30%.

Evaluation of Context-Sensitive Text-Rich Visual Reasoning

Introduction

The advent of instruction-tuned large multimodal models (LMMs) has led to heightened capabilities in responding to human instructions over images. Recent datasets have focused on assessing the Optical Character Recognition (OCR) ability of models, but this falls short in testing the full potential of LMMs to jointly reason over the text and visual context in an image. To bridge this gap, the paper introduces the benchmark C ON TEXTUAL, designed to evaluate the LMMs’ ability to perform context-sensitive reasoning over diverse and challenging real-world scenarios.

C ON T EXTUAL Dataset

C ON TEXTUAL consists of 506 challenging instructions testing LMMs across eight visual scenarios representing daily-life natural or digital scenes. This dataset demands a joint reasoning between the textual and visual cues, something prior datasets do not incentivize sufficiently. The instructions include open-ended questions and imperative tasks, demanding extractive as well as reasoning capabilities beyond information extraction, including mathematical reasoning.

Experimental Setup and Findings

A comprehensive set of experiments were conducted with 13 foundation models, including both proprietary (e.g., GPT-4V, Gemini-Pro-Vision) and open LMMs (e.g., LLaVA-1.5). The findings indicate GPT-4V(ision) outstripping other LMMs, even though it still lags behind human performance by 30.8%. There is a notable performance disparity in open LMMs in comparison to proprietary models, pointing to a need for future advancements that narrow this divide.

Model Performance and Analysis

Qualitative analysis elucidates a range of performance levels, with GPT4V and Gemini-Pro-Vision showcasing superior context-sensitive text-rich visual reasoning, whereas open-source LMMs underperform considerably. The analysis further helps identify issues like hallucination and lack of grounding instructions to the image. Interestingly, in certain abstract categories like memes and quotes, GPT-4V exceeds human performance, indicating the potential for tuning LMMs for better visual context understanding. Additionally, the benchmark C ON T EXTUAL demonstrates the challenging nature and gap present in modern LMMs when it comes to context-sensitive text-rich visual reasoning tasks.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube