Emergent Mind

Abstract

MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4.

Four examples from MM-Vet v2 showcasing improvements over the original MM-Vet dataset.

Overview

  • The paper introduces MM-Vet v2, an enhanced benchmark for evaluating Large Multimodal Models (LMMs) by supplementing the original MM-Vet with new evaluation dimensions and an expanded dataset.

  • MM-Vet v2 encompasses 517 evaluation samples, with an added focus on 'image-text sequence understanding,' aiming to reflect the integrative and sequential nature of real-world multimodal interactions.

  • The benchmark was used to evaluate a range of advanced LMMs, with Claude 3.5 Sonnet and GPT-4o achieving the highest scores, helping to identify model strengths and weaknesses in complex scenarios.

MM-Vet v2: A Comprehensive Benchmark for Evaluating Large Multimodal Models

The paper titled "MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities" introduces an enhanced version of the MM-Vet benchmark, designated as MM-Vet v2, aimed at assessing the integrated capabilities of Large Multimodal Models (LMMs). Developed by researchers from the National University of Singapore, Microsoft, and Advanced Micro Devices, MM-Vet v2 supplements the original MM-Vet with new evaluation dimensions and an expanded dataset.

Introduction and Motivation

Large multimodal models (LMMs) have shown remarkable capabilities in processing complex tasks that involve multiple integrated vision-language (VL) functionalities. Existing benchmarks, notably the original MM-Vet, have facilitated the evaluation of such models by focusing on six core VL capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, these benchmarks primarily focus on single image-text pair evaluations, thus failing to reflect the integrative and sequential nature of real-world multimodal interactions. MM-Vet v2 addresses this shortfall by introducing an additional capability, termed "image-text sequence understanding," which evaluates a model's ability to comprehend interleaved image and text sequences.

Dataset and Methodology

MM-Vet v2 encompasses 517 high-quality evaluation samples, a substantial increase from the 217 samples in MM-Vet. The new dataset is meticulously curated to cover a broad spectrum of scenarios, ranging from daily life contexts to specialized industry applications. Each sample is designed to challenge one or more of the core VL capabilities, extended to include the processing of sequential image-text data. Researchers themselves generated these questions to ensure their complexity and relevance, and ground truth answers were synthesized via a multi-step process involving both language models and expert review.

The evaluation framework maintains the scoring robustness from MM-Vet by leveraging GPT-4 for predicting the quality of model responses. Given the variability in GPT-4 outputs—despite a fixed temperature setting—each model's responses were evaluated multiple times to ensure consistency. Performance metrics across capabilities are presented as average scores.

Experimental Results

MM-Vet v2 was employed to benchmark a range of advanced LMMs, including both open-weight models and closed-source systems. Claude 3.5 Sonnet and GPT-4o achieved the highest scores, 71.8 and 71.0 respectively. These results were based on comprehensive performance measures across various capabilities. For instance, Claude 3.5 Sonnet demonstrated particular strength in recognition, language generation, OCR, spatial awareness, and knowledge capabilities, while GPT-4o excelled in image-text sequence understanding and mathematical reasoning.

Among open-weight models, InternVL2-Llama3-76B emerged as a noteworthy performer with a score of 68.4, showcasing competitive results across multiple VL capabilities.

Implications

The introduction of MM-Vet v2 represents a significant advancement in the evaluation of LMMs. By incorporating the "image-text sequence understanding" capability, the benchmark better reflects the multifaceted and sequential nature of real-world multimodal interactions, thus offering a more holistic assessment of model performance.

From a practical standpoint, MM-Vet v2 addresses the evolving landscape of LMM applications, which require integrative and sequential processing abilities. This benchmark aids in identifying model strengths and weaknesses in more complex problem scenarios, potentially guiding future developmental directions. The rigorous evaluation measures employed in MM-Vet v2 serve to elevate the standards for benchmarking in multimodal AI research.

Future Developments

Future developments in AI, particularly within multimodal domains, could further benefit from extending the benchmark to more diverse and intricate evaluations. Potential enhancements could include real-time interactive scenarios and dynamic dataset expansions to continuously challenge emerging LMM capabilities. Moreover, future versions of the benchmark might explore automatic ground truth generation to scale up the availability of high-quality evaluation samples.

MM-Vet v2 stands as a robust tool in the evaluation arsenal, reinforcing the importance of comprehensive and realistic benchmarking methodologies in the progressive mapping of multimodal model capabilities and their practical applications.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.