Emergent Mind

Abstract

Comprehending text-rich visual content is paramount for the practical application of Multimodal LLMs (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension. In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating \textbf{text-rich visual comprehension} of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of text-rich scenarios in the real world. These categories, due to their inherent complexity and diversity, effectively simulate real-world text-rich environments. We further conduct a thorough evaluation involving 34 prominent MLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize the current limitations of MLLMs in text-rich visual comprehension. We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs. The dataset and evaluation code can be accessed at https://github.com/AILab-CVC/SEED-Bench.

Overview of 63 data types categorized into Charts, Maps, and Webs in SEED-Bench-2-Plus.

Overview

  • SEED-Bench-2-Plus is a benchmark designed to evaluate Multimodal LLMs (MLLMs) on their ability to process complex text-rich visual data such as charts, maps, and web pages.

  • The benchmark includes a comprehensive dataset of 2.3K multiple-choice questions spread across three major categories—charts, maps, and webs—to assess various aspects of visual comprehension in MLLMs.

  • Findings reveal significant disparities in MLLM performance, indicating challenges in handling text-rich visuals and suggesting the need for improved model design and specialized training for different real-world scenarios.

Comprehensive Evaluation of Multimodal LLMs on Text-Rich Visual Data via SEED-Bench-2-Plus

Introduction

The understanding of text-rich visual data is a crucial capability for Multimodal LLMs (MLLMs). These models are tasked with deciphering intricate visual information that includes extensive embedded texts, such as charts, maps, and web pages, which embody real-world applications. To this end, SEED-Bench-2-Plus has been developed as a robust benchmark to comprehensively evaluate the efficacy of various MLLMs in processing such complex data scenarios.

Overview of SEED-Bench-2-Plus

SEED-Bench-2-Plus extends and enriches the scope of the previous benchmark version by introducing a large set of 2.3K multiple-choice questions that capture a broad range of text-rich visual comprehension challenges across three major categories:

  • Charts: Assessing the model's ability to interpret and extract information from various graphical representations.
  • Maps: Evaluating geographical and symbolic data comprehension within different map types.
  • Webs: Testing the capability to understand and interact with data from various webpage layouts.

These categories are further broken down into 63 diverse types, providing a comprehensive framework to test and refine the capabilities of MLLMs in understanding highly textual visual information.

Data Collection and Evaluation Strategy

The benchmark utilizes rigorous methodologies for both data collection and evaluation:

  • Data Source: A combination of manual collection and automation tools (like GPT-4V for question generation) has been employed, ensuring a rich and diverse dataset.
  • Evaluation Strategy: Unlike some existing benchmarks, SEED-Bench-2-Plus utilizes an answer ranking strategy where the likelihood of model-generated responses is matched against human annotations to choose the most probable answer. This approach is designed to minimize bias and reliance on the model's ability to follow specific answering patterns.

Findings from the Benchmark

A comprehensive evaluation involving 34 MLLMs was conducted, revealing varied performance across different models and data types. Notably, traditional textual analysis models and newer multimodal models like GPT-4V, Gemini-Pro-Vision, and Claude-3-Opus were included in this assessment. The evaluation highlighted:

  • General Challenges in Text-Rich Scenarios: There is a significant disparity in model performance, indicating the ongoing challenge of processing complex multimodal and text-rich information.
  • Specific Struggles with Maps and Apps: Maps, which often contain layered information and require contextual comprehension, proved particularly challenging.
  • Performance Variability: There is substantial variability in how different models handle various data types, suggesting that current MLLMs might require further specialization or training to handle the intricacies of real-world text-rich scenarios effectively.

Implications for Future Research

The insights from SEED-Bench-2-Plus underscore several avenues for future work:

  • Model Improvement: There is a critical need for enhancing MLLM design and training approaches to better handle the complexity and variability of text-rich visual data.
  • Benchmark Refinement: Continual development of benchmarks like SEED-Bench-2-Plus is essential to push the boundaries of what MLLMs can understand and how effectively they can operate in real-world applications.
  • Community Collaboration: By making SEED-Bench-2-Plus publicly available and maintaining a leaderboard, the benchmark encourages ongoing community engagement and collaborative improvement in the multimodal AI field.

Conclusion

SEED-Bench-2-Plus offers a rigorous and detailed framework for evaluating the proficiency of MLLMs in understanding text-rich visuals, positioning itself as a critical tool for guiding future advancements in AI research and application. With its comprehensive design and robust evaluation methodology, it sets a new standard for assessing the capabilities of AI models in handling the complexities of real-world data.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.