Emergent Mind

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

(2304.08345)
Published Apr 17, 2023 in cs.LG , cs.CL , cs.CV , cs.MM , and eess.AS

Abstract

In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page https://casia-iva-group.github.io/projects/VALOR.

VALOR's pretraining framework uses three encoders and a decoder for text generation and enhanced generalization.

Overview

  • The paper introduces the VALOR model, aimed at learning from vision, audio, and language modalities for multimodal understanding tasks.

  • VALOR includes tri-modality encoders and a multimodal decoder that enable cross-modality interpretation and text generation based on visual and/or auditory inputs.

  • It presents the VALOR-1M and VALOR-32K datasets to support research and evaluation of vision-audio-language pretraining capabilities.

  • Experimental results show VALOR outperforms existing models in tasks like text-video retrieval and video question answering, demonstrating its ability to learn strong multimodal correlations.

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Introduction

The progression of multimedia understanding tasks, particularly those requiring comprehension across multiple modalities such as vision, audio, and language, necessitates models capable of intricate cross-modality interpretation and generation. Recent endeavors predominantly focus on vision-language pretraining (VLP), neglecting the rich semantic nuances audio can provide. This gap motivates the introduction of a novel Vision-Audio-Language Omni-peRception (VALOR) pretraining model designed for tri-modality learning, offering an end-to-end framework for learning multimodal correlations and producing text conditioned on visual and/or auditory inputs.

VALOR Model and Pretraining Tasks

VALOR innovatively incorporates three single-modality encoders for vision, audio, and language with a multimodal decoder focusing on conditional text generation. The model employs Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC) as pretraining tasks. MGA aims at projecting vision, audio, and language modalities into a shared space to establish alignment across modality pairs (vision-language, audio-language, audiovisual-language) through contrastive learning. Conversely, MGC conditions on randomly masked text tokens, demanding the generation of these tokens based on vision, audio, or both, thus fostering generative capabilities across modalities.

VALOR-1M and VALOR-32K Datasets

To facilitate research in vision-audio-language pretraining, a large-scale dataset named VALOR-1M is constructed, consisting of 1M videos with manually annotated audiovisual captions to capture both auditory and visual content descriptions. Simultaneously, the VALOR-32K benchmark set is introduced for evaluating audiovisual-language capabilities, including novel tasks like audiovisual retrieval (AVR) and audiovisual captioning (AVC). These datasets collectively provide a robust foundation for examining the cross-modality learning efficacy of models like VALOR.

Experimental Results

Extensive experiments demonstrate VALOR's capability to learn strong multimodal correlations and generalize across various cross-modality tasks, including retrieval, captioning, and question answering, with diverse input modalities. Notably, VALOR achieves substantial improvements over state-of-the-art methods on public benchmarks, outperforming previous best-performing models by significant margins in tasks such as text-video retrieval and video question answering.

Conclusion and Future Work

The VALOR model, accompanied by the VALOR-1M and VALOR-32K datasets, sets a new standard for vision-audio-language pretraining research. The proposed model and datasets address the critical need for integrating audio modality into multimodal understanding and generation, underscoring the importance of audio-visual-textual alignment for comprehensive multimedia analysis. Future work may explore extending the VALOR framework to cover more diverse modalities and complex tasks, potentially incorporating unsupervised methods for expanding the VALOR-1M dataset or integrating generative modeling for vision and audio modalities.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.