Emergent Mind

Distilling Vision-Language Models on Millions of Videos

(2401.06129)
Published Jan 11, 2024 in cs.CV

Abstract

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset to date.

Overview

  • The paper introduces a method for adapting image-based vision-language models to videos using high-quality pseudo-captions generated from web videos.

  • A two-step fine-tuning process is utilized that first adjusts the visual encoder with video captions and then tunes the language model with instruction-following data.

  • Pseudo-captioning provides scalable, detailed, and relevant textual supervision that captures video temporal dynamics, enhancing model training.

  • The adapted model shows significant improvements on video-language benchmarks, with performance scaling positively with data amount.

  • This breakthrough could lead to more sophisticated multimodal AI systems capable of understanding video content on a large scale.

Overview of the Paper

In this paper, an innovative approach is presented for adapting image-based vision-language models (VLMs) to videos. The researchers have developed a method to address the scarcity of human-labeled video data by generating high-quality pseudo-captions from millions of web-scraped videos. This approach entailed fine-tuning a VLM on video-captioning data, followed by auto-generating video descriptions to train a video-language dual-encoder model. This dual-encoder model demonstrated state-of-the-art performance on various benchmarks such as the MSR-VTT for text-to-video retrieval.

Methodology

The adaptation process is twofold. Firstly, the visual component of the VLM is refined using video captions, optimizing for scene dynamics over static appearance. Here, the language model is kept constant to avoid degradation from simple, repetitive patterns in video text-data. Secondly, the language model is then tuned with instruction-following data—questions and answers prompted by another language model—while the visual encoder remains unchanged.

The approach is made robust by utilizing instruction-following data that emphasize causal and temporal reasoning. This method both enriches the model's inference capabilities and ensures diversity and detail in generated video pseudo-captions.

Benefits of Pseudo-Captions

The pseudo-captioning process offers several advantages. They are relevant to video content and capture the temporal dynamics that image-based captions miss. Moreover, the language model can generate multiple captions simultaneously, leading to a scalable annotation process. This method provides detailed descriptions that greatly enhance the quality of textual supervision compared to existing methods.

Evaluating the Adapted Model

The adapted model's effectiveness was assessed through a range of video-language benchmarks, illustrating improvements across the board. With the pseudo-captions used to pre-train a dual-encoder model, a scaling trend was observed with the model's performance increasing with the amount of data. In contrastive pre-training, models trained on pseudo-captions significantly outperformed those trained on original video dataset captions for text-to-video retrieval and video classification tasks.

Summary and Impact

The technique developed in this study for adapting VLMs to video has made great strides in video-language understanding. It is reflected in the notable performance upgrades seen in zero-shot video retrieval and classification tasks. This advancement, particularly in the context of scarce video-text data, paves the way for more nuanced and sophisticated multimodal AI systems that can effectively analyze and understand video content at scale.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.