Emergent Mind

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

(2402.04788)
Published Feb 7, 2024 in cs.CL , cs.AI , and cs.CV

Abstract

Multimodal LLMs (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence multimodal benchmarks that align with human preferences. Inspired by LLM-as-a-Judge in LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges including three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparisons, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking tasks. Furthermore, MLLMs still face challenges in judgment, including diverse biases, hallucinatory responses, and inconsistencies, even for advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts regarding MLLMs as fully reliable evaluators. Code and dataset are available at https://github.com/Dongping-Chen/MLLM-as-a-Judge.

Overview

  • The paper introduces a benchmark named MLLM-as-a-Judge for evaluating the performance of Multimodal LLMs (MLLMs) in autonomous evaluation tasks across various multimodal scenarios.

  • It utilizes four prominent MLLMs: GPT-4V, Gemini, LLaVA, and CogVLM, to assess their judgment consistency, bias inclination, and susceptibility to hallucinations, using a selection of 3,300 image-instruction pairs.

  • The study finds substantial alignment between MLLM judgments and human preferences in Pair Comparisons, but significant discrepancies in Scoring Evaluation and Batch Ranking, especially in complex reasoning tasks.

  • The paper highlights the challenges MLLMs face, such as biases and hallucinations, and introduces two novel datasets aimed at improving MLLMs' performance and reliability in judgment tasks.

Unveiling the Judging Capabilities of Multimodal LLMs

The integration of visual comprehension with linguistic processing in AI, through Multimodal LLMs (MLLMs), marks a significant stride towards achieving artificial general intelligence. In the context of this evolutionary step, our paper introduces a pioneering benchmark named MLLM-as-a-Judge, designed to systematically evaluate the efficacy of MLLMs in performing as autonomous evaluators across various multimodal tasks. This benchmark is tailored to rigorously scrutinize MLLMs' ability to offer judgments mirroring human preferences and discernment.

Benchmark Development and Key Findings

Our benchmark is built around three core tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. It encompasses a meticulously curated selection of 3,300 image-instruction pairs derived from a wide range of fields including text captioning, math reasoning, and infographic interpretation. Utilizing four prominent MLLMs — GPT-4V, Gemini, LLaVA, and CogVLM — we embark on an extensive evaluation to gauge their judgment consistency, bias inclination, and susceptibility to hallucinations against human-labeled standards.

A notable observation from our study is the substantial alignment of MLLM judgments with human preferences in the realm of Pair Comparisons. However, significant discrepancies emerge in Scoring Evaluation and Batch Ranking tasks, particularly in areas necessitating complex reasoning. These findings underline a crucial disparity between MLLM-generated judgments and human expectations, highlighting areas where these models falter.

Challenges and Implications

Our analysis further sheds light on persistent challenges faced by MLLMs. These include a propensity for biases — egocentric, position, and length biases — and a tendency towards generating hallucinatory responses. Interestingly, the application of Chain-of-Thought reasoning and integration of a vision expert system demonstrates potential in mitigating some of these biases.

Importantly, our work presents two novel datasets: MLLM-AS-A-JUDGE-HQ, comprising responses highly aligned with human judgments, and MLLM-AS-A-JUDGE-HARD, featuring responses marked by inconsistencies and hallucinations. These datasets are envisioned as a rigorous testing ground for advancing MLLMs.

Contributions and Future Directions

By introducing the MLLM-AS-A-JUDGE benchmark, our research paves the way for a systematic assessment of MLLM's judging abilities in multimodal tasks. The discrepancies uncovered between MLLM judgments and human preferences broach critical conversations about the need for enhanced algorithmic accuracy, fairness, and interpretability in AI evaluations.

In navigating the future landscape of MLLM research, it is imperative to address the identified limitations, biases, and hallucinations, to edge closer to developing MLLMs that can reliably perform judgment tasks across diverse modalities. Our benchmark and datasets stand as a testament to this evolving journey, urging the AI community to seek innovative solutions that bridge the gap between machine-generated judgments and human expectations.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.