Emergent Mind

Abstract

Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of LLM. However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established. This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. Our approach demonstrates robust alignment with OpenAI GPT-4V model's reward mechanism, which directly takes video frames as input. Furthermore, we show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks.

Workflow showcases GPT-4V enhancing video dataset captions, generating instructional data, and boosting model performance.

Overview

  • A novel framework leveraging Direct Preference Optimization (DPO) to enhance video large multimodal models (video LMMs) for Video Question Answering tasks by using detailed video captions for better accuracy assessment.

  • Introduction of ShareGPTVideo dataset containing 900k video captions to aid in the effective training and evaluation of video LMMs with a three-stage training pipeline.

  • The proposed framework significantly improves video LMMs' performance on video QA tasks, with the LLaVA-Hound-DPO model showing an 8.1% accuracy improvement over its SFT counterpart.

  • The research offers a scalable and cost-effective reward mechanism for video content understanding and opens new directions for multimodal model training and evaluation.

Enhancing Video Large Multimodal Models with Direct Preference Optimization from Language Model Rewards

Introduction

Researchers have developed a novel framework that leverages Direct Preference Optimization (DPO) to substantially improve the performance of video large multimodal models (video LMMs) on Video Question Answering (Video QA) tasks. This groundbreaking work introduces an innovative reward mechanism using detailed video captions as a proxy for video content. This enables the language models to assess the accuracy of responses generated by video LMMs more effectively.

The Challenge

In the face of escalating demand for video content understanding, enhancing the capability of video LMMs to accurately follow video instructions has emerged as a significant challenge. Traditional Reinforcement Learning (RL) and DPO approaches, while effective in text-based domains, have struggled with multimodal contexts, such as video, primarily due to difficulties in developing robust reward systems. Addressing the challenges of costly human preference data collection and scalability issues with reinforcement learning models, the paper proposes a new approach that leverages video captions to improve model alignment and performance in video-based tasks.

Dataset and Methodology

To address the challenges in evaluating video LMMs, the researchers devised a comprehensive dataset named ShareGPTVideo. The dataset contains 900k detailed video captions, capturing a wide range of video content elements such as temporal dynamics and spatial relationships. These captions serve as a foundation for the proposed reward mechanism by providing a rich source of information for language models to assess the factual alignment of video LMM responses.

The paper outlines a three-stage training pipeline for the proposed framework:

  1. Caption Pre-training Stage: Utilizes the newly introduced video caption data for pre-training, enriching the model's understanding of video content.
  2. Supervised Fine-Tuning (SFT) Stage: Involves fine-tuning with video instruction-following data generated from the detailed video captions, ensuring the model's responses are grounded in the video content.
  3. Direct Preference Optimization (DPO) Stage: Applies the DPO algorithm to refine the model's responses further, using rewards derived from a language model's assessment of the responses' factual alignment.

Experimental Results

The experimental evaluation demonstrates the effectiveness of the proposed framework in enhancing video LMMs' performance on video QA tasks. Notably, the LLaVA-Hound-DPO model, which incorporates the DPO training stage, achieved an 8.1% improvement in accuracy over its SFT counterpart. This significant performance enhancement illustrates the value of utilizing video captions as proxies for video content in the DPO process.

Implications and Future Work

This research represents a significant advancement in the alignment and performance of video LMMs on video QA tasks. The introduction of a cost-effective and scalable reward mechanism using detailed video captions as proxies offers a promising direction for future work in multimodal model training and evaluation. The work also opens up new possibilities for exploring other domains where video content understanding is critical. Future research might include expanding the dataset to cover a broader range of video types and exploring other model architectures to further improve performance and alignment in video-based tasks.

Conclusion

In conclusion, this paper presents a novel approach to improving video LMMs through a detailed video caption dataset and a tailored DPO method. The proposed framework not only enhances model performance on video QA tasks but also addresses the scalability challenges associated with training multimodal models. This work lays a solid foundation for further research in video content understanding and model alignment, marking a notable contribution to the field of AI and multimodal learning.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.