Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 213 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark (2402.04788v3)

Published 7 Feb 2024 in cs.CL, cs.AI, and cs.CV

Abstract: Multimodal LLMs (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: \url{https://mLLM-judge.github.io/}.

Citations (35)

Summary

  • The paper introduces the MLLM-as-a-Judge benchmark to systematically assess multimodal LLMs as autonomous evaluators.
  • It evaluates key tasks including scoring evaluation, pair comparison, and batch ranking using 3,300 image-instruction pairs, revealing strong human alignment in pair comparisons.
  • The study highlights persistent challenges with biases, complex reasoning, and hallucinations, urging improvements for robust multimodal judgment capabilities.

Unveiling the Judging Capabilities of Multimodal LLMs

The integration of visual comprehension with linguistic processing in AI, through Multimodal LLMs (MLLMs), marks a significant stride towards achieving artificial general intelligence. In the context of this evolutionary step, our paper introduces a pioneering benchmark named MLLM-as-a-Judge, designed to systematically evaluate the efficacy of MLLMs in performing as autonomous evaluators across various multimodal tasks. This benchmark is tailored to rigorously scrutinize MLLMs' ability to offer judgments mirroring human preferences and discernment.

Benchmark Development and Key Findings

Our benchmark is built around three core tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. It encompasses a meticulously curated selection of 3,300 image-instruction pairs derived from a wide range of fields including text captioning, math reasoning, and infographic interpretation. Utilizing four prominent MLLMs — GPT-4V, Gemini, LLaVA, and CogVLM — we embark on an extensive evaluation to gauge their judgment consistency, bias inclination, and susceptibility to hallucinations against human-labeled standards.

A notable observation from our paper is the substantial alignment of MLLM judgments with human preferences in the field of Pair Comparisons. However, significant discrepancies emerge in Scoring Evaluation and Batch Ranking tasks, particularly in areas necessitating complex reasoning. These findings underline a crucial disparity between MLLM-generated judgments and human expectations, highlighting areas where these models falter.

Challenges and Implications

Our analysis further sheds light on persistent challenges faced by MLLMs. These include a propensity for biases — egocentric, position, and length biases — and a tendency towards generating hallucinatory responses. Interestingly, the application of Chain-of-Thought reasoning and integration of a vision expert system demonstrates potential in mitigating some of these biases.

Importantly, our work presents two novel datasets: MLLM-AS-A-JUDGE-HQ, comprising responses highly aligned with human judgments, and MLLM-AS-A-JUDGE-HARD, featuring responses marked by inconsistencies and hallucinations. These datasets are envisioned as a rigorous testing ground for advancing MLLMs.

Contributions and Future Directions

By introducing the MLLM-AS-A-JUDGE benchmark, our research paves the way for a systematic assessment of MLLM's judging abilities in multimodal tasks. The discrepancies uncovered between MLLM judgments and human preferences broach critical conversations about the need for enhanced algorithmic accuracy, fairness, and interpretability in AI evaluations.

In navigating the future landscape of MLLM research, it is imperative to address the identified limitations, biases, and hallucinations, to edge closer to developing MLLMs that can reliably perform judgment tasks across diverse modalities. Our benchmark and datasets stand as a testament to this evolving journey, urging the AI community to seek innovative solutions that bridge the gap between machine-generated judgments and human expectations.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com