Emergent Mind

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

(2406.11839)
Published Jun 17, 2024 in cs.CV , cs.AI , cs.CL , and cs.LG

Abstract

Direct preference optimization (DPO) has shown to be an effective method for LLM alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.

Standard DPO issues and a proposed solution with an additional image learning objective and reward anchor.

Overview

  • The paper addresses the challenges of applying Direct Preference Optimization (DPO) to multimodal LLMs, particularly the neglect of visual data in favor of text.

  • It proposes two novel enhancements to DPO: Conditional Preference Optimization, which enforces the importance of visual data, and Anchored Preference Optimization, which maintains high-quality responses.

  • Experiments show that these methods significantly improve performance, reduce hallucination rates, and are robust across various models and benchmarks.

Conditional Preference Optimization for Multimodal LLMs

The paper "Conditional Preference Optimization for Multimodal LLMs" addresses the challenges associated with applying Direct Preference Optimization (DPO) to multimodal LLMs. While DPO has been effective for aligning LLMs with human preferences in single-modal (mainly text-based) scenarios, its extension to multimodal contexts has faced non-trivial issues. These issues particularly include the model's failure to adequately incorporate visual information, resulting in an overemphasis on textual data.

Key Contributions

The paper identifies and seeks to mitigate the unconditional preference problem, where multimodal models trained using standard DPO tend to neglect the visual modality. To this end, the authors propose an enhanced DPO approach specifically catering to multimodal LLMs, denoted as . This new method combines two novel components into the DPO framework: Conditional Preference Optimization and Anchored Preference Optimization.

Conditional Preference Optimization

One of the significant findings of the paper is that multimodal LLMs often perform similarly even when visual information is omitted during preference optimization. This suggests that these models frequently disregard visual inputs, relying primarily on textual data. The core of the proposed solution involves creating pairs of tuples where images are varied to enforce visual modality's importance in preference decisions. Specifically, the approach constructs a rejected image ( ml ) by manipulating the chosen image ( mw ) (e.g., random cropping) to ensure it retains less visual information, thus making it a harder negative sample. This strategy ensures the model learns to differentiate visual cues accurately, optimizing for conditional preferences.

Anchored Preference Optimization

The paper also highlights an intrinsic issue with standard DPO: the likelihood of preferred responses can decrease during training, which is counterproductive because these responses are of high quality. To counteract this, an anchored preference optimization is introduced, which enforces a positive reward for chosen responses. By anchoring the optimization, the method effectively avoids a scenario where the model might simply increase the likelihood gap by degrading the probability of good responses.

Experimental Validation

Experiments are conducted using two multimodal LLMs, Bunny-v1.0-3B and LLaVA-v1.5-7B, evaluated on three benchmarks: MMHalBench, Object HalBench, and AMBER. The results consistently demonstrate that outperforms standard DPO, especially in reducing hallucination rates across different models and datasets. Numerical results from these experiments underscore the effectiveness of the proposed modifications:

  • On MMHalBench, applied to Bunny-v1.0-3B improved the overall score to 2.96 from 2.28 (DPO), while reducing the hallucination rate from 0.56 to 0.42.
  • Fine-grained results emphasize the method's superiority on specific question categories like adversarial queries, showcasing its robustness under varied conditions.

Future Implications and Speculations

The methodologies and findings of this paper hold substantial implications for future advancements in AI, particularly in the realm of multimodal learning. By introducing conditional preference mechanisms, models can better exploit multimodal data, leading to more accurate and contextually aware AI systems. The anchoring mechanism could also inspire further research into reward stabilization techniques in reinforcement learning scenarios beyond multimodal applications.

Potential future developments could include:

  • Integration with other Optimization Techniques: Combining with other optimization strategies like on-policy sampling or hybrid reward models could yield even better performance.
  • Broader Model Architectures: Extending the principles of to different architectures and model sizes to assess generalizability.
  • Diverse Modalities: Exploring applications beyond the current text and image modalities, incorporating audio, video, or even multisensory inputs.
  • Real-world Applications: Implementing such optimizations in real-world settings, including virtual assistants, autonomous systems, and multimodal content creation tools where the accuracy and coherence of multimodal inputs are critical.

Conclusion

The proposed method offers a nuanced and effective solution to the identified challenges in multimodal preference optimization, ensuring a balanced incorporation of visual and textual data. By refining the existing DPO framework with conditional preference optimization and anchored preference optimization, the paper makes a notable contribution to enhancing the alignment process in multimodal LLMs. The empirical results are compelling, and the methodological innovations hold promise for further advancements in AI research and applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.