Emergent Mind

Abstract

Multi-modal LLMs (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance understanding of the gap through the lens of a qualitative study on the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities: ie, text, code, image, and video, ultimately aiming to improve the transparency of MLLMs. We believe these properties are several representative factors that define the reliability of MLLMs, in supporting various downstream applications. To be specific, we evaluate the closed-source GPT-4 and Gemini and 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designed cases, where the qualitative results are then summarized into 12 scores (ie, 4 modalities times 3 properties). In total, we uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs, towards more reliable downstream multi-modal applications.

Overview

  • The paper evaluates the performance of Multi-Modal LLMs (MLLMs) like GPT-4 and Gemini, along with several open-source models, across text, code, image, and video modalities focusing on generalizability, trustworthiness, and causal reasoning.

  • The study uses 232 manually designed cases to analyze the models, revealing that GPT-4 excels in logical reasoning, commonsense reasoning, mathematical problem-solving, and safety, while Gemini shows superior performance in multilingual translation tasks but falls short in mathematical and reasoning tasks.

  • The findings emphasize the importance of improving MLLMs' trustworthiness and causal reasoning, particularly for practical applications in sensitive fields such as healthcare, and advocate for the development and adoption of enhanced evaluation metrics for multimodal inputs.

Assessing Multi-Modal LLMs: Generalizability, Trustworthiness, and Causality

The evaluation of Multi-Modal LLMs (MLLMs) is complex due to their diverse functionality across multiple input types—text, code, image, and video. The study titled "From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities" offers a detailed examination, focusing on proprietary models like GPT-4 and Gemini, as well as various open-source models. The research aims to bridge the gap between the high performance of MLLMs and public expectations through qualitative assessment.

Methodology

The authors evaluate the models using 232 manually designed cases spanning text, code, image, and video modalities. Performance is analyzed across three core properties: generalizability, trustworthiness, and causal reasoning. The models tested include the closed-source GPT-4 and Gemini, along with six open-source alternatives. The paper ultimately runs the models through a spectrum of qualitative tests summarized into 12 scores, uncovering 14 empirical findings regarding their competences and limitations.

Key Findings

Textual and Coding Capabilities

GPT-4 outperforms Gemini and all open-source models in overall text and coding abilities. For instance, in logical and commonsense reasoning tasks, GPT-4's accuracy and robustness are notably higher. Gemini, while providing more nuanced translations and demonstrating strong performance in multilingual settings, falls short in mathematical reasoning and domain-specific knowledge.

  • GPT-4: Superior in logical reasoning, commonsense reasoning, and mathematical problem solving, also displaying high competency in text and code trustworthiness and safety.
  • Gemini: Outperforms GPT-4 in some translation tasks and multilingual capabilities but struggles with mathematical and reasoning tasks.

Multilingual Capabilities

Gemini's capacity to understand and translate idioms and complex sentences from languages like Chinese surpasses that of GPT-4 and open-source models. The paper highlights Gemini's elegant translation of Chinese idioms, which often pose significant challenges due to cultural nuances.

Domain Knowledge

In specialized fields like medicine and economics, GPT-4 consistently provides accurate and detailed responses. For example, in medical diagnostic scenarios, GPT-4 correctly identifies conditions and gives relevant advice more accurately than Gemini or any open-source model.

Trustworthiness Assessment

The paper’s trustworthiness evaluation encompasses safety, reliability, robustness, morality, data protection, fairness, and legality.

  • Safety: GPT-4 demonstrates superior safety performance, effectively identifying and avoiding toxic content and extreme risks. By contrast, Gemini Pro fails to consistently recognize such risks.
  • Reliability and Robustness: GPT-4 excels in producing reliable, accurate content and shows fewer instances of hallucinations compared to its peers.
  • Ethical Compliance: Both in testing for potentially harmful advice and identifying content that does not adhere to social norms, GPT-4 outperforms Gemini and the open-source models.

Causal Reasoning

This property evaluates the ability of MLLMs to understand and apply cause-effect relationships across different modalities. Here, GPT-4 again shows stronger performance, particularly in understanding complex scenarios in video content. However, all tested models struggle with tasks requiring nuanced causal reasoning, such as predicting the outcome of dynamic interactions in videos.

Practical and Theoretical Implications

The findings underline the critical need for advancing MLLM trustworthiness and causal reasoning to meet the expectations for real-world applications. For practical uses, particularly those involving sensitive domains like healthcare, the consistency and reliability demonstrated by GPT-4 are essential. Contrarily, the performance gaps highlighted in Gemini and other open-source models suggest they need further development for deployment in high-stakes environments.

Future Directions

The study advocates the continuous refinement of evaluation metrics tailored to multimodal inputs and encourages a broader adoption of open-source models to mitigate over-reliance on proprietary systems. Additionally, the gap in identifying and handling extreme risks and ethical breaches in models like Gemini signifies an area for focused improvement.

Conclusion

The research presents a thorough comparative analysis of leading MLLMs, emphasizing the nuanced strengths and evident weaknesses of proprietary and open-source models. GPT-4 stands out for its reliability and robustness, whereas Gemini excels in specific multilingual tasks but needs enhancement in trustworthiness and causal reasoning. These insights are crucial for guiding the development of more dependable and ethically aligned MLLMs, ensuring their readiness for broader, impactful use.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.