RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Published 1 Dec 2023 in cs.CL and cs.CV | (2312.00849v2)

Abstract: Multimodal LLMs (MLLMs) have recently demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. However, existing MLLMs prevalently suffer from serious hallucination problems, generating text that is not factually grounded in associated images. The problem makes existing MLLMs untrustworthy and thus impractical in real-world (especially high-stakes) applications. To address the challenge, we present RLHF-V, which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback. Specifically, RLHF-V collects human preference in the form of segment-level corrections on hallucinations, and performs dense direct preference optimization over the human feedback. Comprehensive experiments on five benchmarks in both automatic and human evaluation show that, RLHF-V can enable substantially more trustworthy MLLM behaviors with promising data and computation efficiency. Remarkably, using 1.4k annotated data samples, RLHF-V significantly reduces the hallucination rate of the base MLLM by 34.8%, outperforming the concurrent LLaVA-RLHF trained on 10k annotated data. The final model achieves state-of-the-art performance in trustworthiness among open-source MLLMs, and shows better robustness than GPT-4V in preventing hallucinations aroused from over-generalization. We open-source our code, model, and data at https://github.com/RLHF-V/RLHF-V.

Abstract PDF Upgrade to Chat

Citations (115)

View on Semantic Scholar

Summary

The paper demonstrates that fine-grained correctional human feedback reduces hallucinations in MLLMs by 13.8%, significantly enhancing model trustworthiness.
It compares RLHF-V with GPT-4V, revealing trade-offs between elaborate content generation and concentrated hallucination instances.
The study shows effective visual instruction distillation, increasing object mentions by 1.8 times while highlighting challenges in model alignment.

Analysis of RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

The paper presents an in-depth exploration of a novel framework, RLHF-V, designed to enhance the trustworthiness of multimodal LLMs (MLLMs) through behavior alignment facilitated by fine-grained correctional human feedback. By incorporating these feedback mechanisms, the framework aims to address common issues associated with generative AI, particularly hallucination tendencies in LLMs when generating content based on visual data.

Enhancements to LLaVA MLLM

The study first applies the RLHF-V framework to the LLaVA model, which is well-regarded in the MLLM landscape. The results demonstrate a significant reduction in hallucination occurrences by 13.8%, indicating the potential of RLHF-V to enhance model reliability across different contexts. This reduction substantiates the framework's applicability in refining the decision-making processes of MLLMs by tuning model outputs closer to human-like reasoning.

Comparative Study: RLHF-V and GPT-4V

Further evaluation involves a comparative analysis of RLHF-V with GPT-4V. Notably, GPT-4V shows a propensity for elaborate descriptions, with an increased resolution and robustness due to its advanced architecture. Although GPT-4V demonstrates a lower overall hallucination rate by 17.3% in comprehensive ALL metrics, its hallucination instances are more concentrated, revealing a trade-off common in fine-grained visual processing.

Through these insights, the study highlights the nuanced trade-offs and specific strengths of RLHF-V, emphasizing its resistance to overgeneralization problems in comparison to GPT-4V. The latter's tendency to elaborate extensively is identified as a double-edged sword, potentially leading to hallucinations if instruction data excessively surpasses the model's foundational capacities.

Implications of Visual Instruction Distillation

The study explores the potential of distilling GPT-4V capabilities through visual instruction tuning. Distillation attempts with RLHF-V resulted in an increased object mention in responses by 1.8 times, although this led to heightened hallucination rates. This outcome aligns with the hypothesis that incongruous complexity in instruction data can exacerbate inaccuracies, a phenomenon rooted in task-model alignment issues.

Qualitative Analysis and Model Comparisons

Qualitative assessments further cement RLHF-V's position as a model yielding reduced hallucinations in both short-form and long-form QA scenarios relative to open-source counterparts like LLaVA-RLHF and InstructBLIP. These insights are paired with thorough implementation details, outlining the training efficiency of RLHF-V and highlighting a broader applicability with relatively low computational demands.

Conclusion and Future Prospects

This research underscores the importance of aligning model capabilities with appropriate instruction data and feedback for improved behavioral adaptation in MLLMs. The results have both theoretical and practical implications, suggesting avenues for further optimization in automatic visual description tasks and trustworthiness in AI systems. Future developments may focus on refining the balance between model complexity and training data granularity and extending the framework to other large-scale vision-LLMs to validate its versatility and robustness.

Markdown Report Issue