Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward (2404.01258v2)

Published 1 Apr 2024 in cs.CV and cs.AI

Abstract: Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of LLM. However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established. This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling LLMs to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. Our approach demonstrates robust alignment with OpenAI GPT-4V model's reward mechanism, which directly takes video frames as input. Furthermore, we show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks.

References (42)

Citations (29)

View on Semantic Scholar

Summary

The paper introduces a novel DPO framework that leverages video captions as proxy rewards to enhance Video QA performance.
It outlines a three-stage training pipeline: caption pre-training, supervised fine-tuning, and direct preference optimization for model alignment.
Experimental results show an 8.1% accuracy boost over standard fine-tuning, underscoring a scalable, cost-effective method for video content understanding.

Enhancing Video Large Multimodal Models with Direct Preference Optimization from LLM Rewards

Introduction

Researchers have developed a novel framework that leverages Direct Preference Optimization (DPO) to substantially improve the performance of video large multimodal models (video LMMs) on Video Question Answering (Video QA) tasks. This groundbreaking work introduces an innovative reward mechanism using detailed video captions as a proxy for video content. This enables the LLMs to assess the accuracy of responses generated by video LMMs more effectively.

The Challenge

In the face of escalating demand for video content understanding, enhancing the capability of video LMMs to accurately follow video instructions has emerged as a significant challenge. Traditional Reinforcement Learning (RL) and DPO approaches, while effective in text-based domains, have struggled with multimodal contexts, such as video, primarily due to difficulties in developing robust reward systems. Addressing the challenges of costly human preference data collection and scalability issues with reinforcement learning models, the paper proposes a new approach that leverages video captions to improve model alignment and performance in video-based tasks.

Dataset and Methodology

To address the challenges in evaluating video LMMs, the researchers devised a comprehensive dataset named ShareGPTVideo. The dataset contains 900k detailed video captions, capturing a wide range of video content elements such as temporal dynamics and spatial relationships. These captions serve as a foundation for the proposed reward mechanism by providing a rich source of information for LLMs to assess the factual alignment of video LMM responses.

The paper outlines a three-stage training pipeline for the proposed framework:

Caption Pre-training Stage: Utilizes the newly introduced video caption data for pre-training, enriching the model's understanding of video content.
Supervised Fine-Tuning (SFT) Stage: Involves fine-tuning with video instruction-following data generated from the detailed video captions, ensuring the model's responses are grounded in the video content.
Direct Preference Optimization (DPO) Stage: Applies the DPO algorithm to refine the model's responses further, using rewards derived from a LLM's assessment of the responses' factual alignment.

Experimental Results

The experimental evaluation demonstrates the effectiveness of the proposed framework in enhancing video LMMs' performance on video QA tasks. Notably, the LLaVA-Hound-DPO model, which incorporates the DPO training stage, achieved an 8.1% improvement in accuracy over its SFT counterpart. This significant performance enhancement illustrates the value of utilizing video captions as proxies for video content in the DPO process.

Implications and Future Work

This research represents a significant advancement in the alignment and performance of video LMMs on video QA tasks. The introduction of a cost-effective and scalable reward mechanism using detailed video captions as proxies offers a promising direction for future work in multimodal model training and evaluation. The work also opens up new possibilities for exploring other domains where video content understanding is critical. Future research might include expanding the dataset to cover a broader range of video types and exploring other model architectures to further improve performance and alignment in video-based tasks.

Conclusion

In conclusion, this paper presents a novel approach to improving video LMMs through a detailed video caption dataset and a tailored DPO method. The proposed framework not only enhances model performance on video QA tasks but also addresses the scalability challenges associated with training multimodal models. This work lays a solid foundation for further research in video content understanding and model alignment, marking a notable contribution to the field of AI and multimodal learning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1775030147671814203

https://twitter.com/RuohongZhang/status/1775577275255161239

https://twitter.com/gm8xx8/status/1775067985574567943

https://twitter.com/javaeeeee1/status/1775124699602645406

https://twitter.com/RuohongZhang/status/1775568362069770711

https://twitter.com/knishimae0531/status/1775732493796466748