Distilling Vision-Language Models on Millions of Videos (2401.06129v2)

Published 11 Jan 2024 in cs.CV

Abstract: The recent advance in vision-LLMs is largely attributed to the abundance of image-text data. We aim to replicate this success for video-LLMs, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-LLM from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-LLM performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-LLMs. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset to date.

References (80)

Citations (8)

View on Semantic Scholar

Summary

The paper presents a novel method that adapts image-based vision-language models to videos by generating high-quality pseudo-captions from web-scraped data.
It employs a dual-phase fine-tuning strategy that updates the visual encoder with video-caption data while refining the language model using instruction-following inputs.
The approach achieves state-of-the-art performance on benchmarks like MSR-VTT, enhancing text-to-video retrieval and classification tasks.

Overview of the Paper

In this paper, an innovative approach is presented for adapting image-based vision-LLMs (VLMs) to videos. The researchers have developed a method to address the scarcity of human-labeled video data by generating high-quality pseudo-captions from millions of web-scraped videos. This approach entailed fine-tuning a VLM on video-captioning data, followed by auto-generating video descriptions to train a video-language dual-encoder model. This dual-encoder model demonstrated state-of-the-art performance on various benchmarks such as the MSR-VTT for text-to-video retrieval.

Methodology

The adaptation process is twofold. Firstly, the visual component of the VLM is refined using video captions, optimizing for scene dynamics over static appearance. Here, the LLM is kept constant to avoid degradation from simple, repetitive patterns in video text-data. Secondly, the LLM is then tuned with instruction-following data—questions and answers prompted by another LLM—while the visual encoder remains unchanged.

The approach is made robust by utilizing instruction-following data that emphasize causal and temporal reasoning. This method both enriches the model's inference capabilities and ensures diversity and detail in generated video pseudo-captions.

Benefits of Pseudo-Captions

The pseudo-captioning process offers several advantages. They are relevant to video content and capture the temporal dynamics that image-based captions miss. Moreover, the LLM can generate multiple captions simultaneously, leading to a scalable annotation process. This method provides detailed descriptions that greatly enhance the quality of textual supervision compared to existing methods.

Evaluating the Adapted Model

The adapted model's effectiveness was assessed through a range of video-language benchmarks, illustrating improvements across the board. With the pseudo-captions used to pre-train a dual-encoder model, a scaling trend was observed with the model's performance increasing with the amount of data. In contrastive pre-training, models trained on pseudo-captions significantly outperformed those trained on original video dataset captions for text-to-video retrieval and video classification tasks.

Summary and Impact

The technique developed in this paper for adapting VLMs to video has made great strides in video-language understanding. It is reflected in the notable performance upgrades seen in zero-shot video retrieval and classification tasks. This advancement, particularly in the context of scarce video-text data, paves the way for more nuanced and sophisticated multimodal AI systems that can effectively analyze and understand video content at scale.

PDF Markdown

Tweets

https://twitter.com/fly51fly/status/1745821978085482821

https://twitter.com/kashifcreations/status/1745764080097612098

https://twitter.com/arxivsanitybot/status/1745800109349126188