Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training (1908.06066v3)

Published 16 Aug 2019 in cs.CV

Abstract: We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM and Unicoder, both visual and linguistic contents are fed into a multi-layer Transformer for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked LLMing (MLM), Masked Object Classification (MOC) and Visual-linguistic Matching (VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training.

Citations (851)

View on Semantic Scholar

Summary

The paper introduces Unicoder-VL, a multi-layer Transformer model that jointly encodes visual and textual data for robust cross-modal representations.
It employs three pre-training tasks—masked language modeling, masked object classification, and visual-linguistic matching—to learn context-aware multi-modal embeddings.
Evaluated on image-text retrieval and VCR tasks, Unicoder-VL achieves competitive results, demonstrating strong generalization and reasoning capabilities.

Overview

The paper "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training" proposes an innovative model designed to produce joint representations of vision and language. The proposed model, Unicoder-VL, leverages a multi-layer Transformer architecture to pre-train on large-scale image-caption datasets, thus learning contextualized embeddings across modalities. The model is evaluated on various downstream tasks, exhibiting state-of-the-art or competitive results, highlighting the efficacy of the cross-modal pre-training approach.

Key Contributions

Unified Model Architecture: Unicoder-VL utilizes a multi-layer Transformer to jointly encode visual and linguistic data, which is crucial for tasks needing comprehensive understanding of both modalities.
Effective Pre-training Tasks: The model is pre-trained employing three tasks:
- Masked LLMing (MLM): Inspired by BERT, it predicts masked words from context.
- Masked Object Classification (MOC): It predicts object labels in images whose features are masked.
- Visual-Linguistic Matching (VLM): It learns to determine whether a given image and text description are semantically aligned.
Large-Scale Pre-training Data: Utilizing approximately 3.8 million image-caption pairs from the Conceptual Captions and SBU Captions datasets, the model's ability to learn robust cross-modal representations is significantly enhanced.

Experimental Results

The model's performance is validated on multiple visual-linguistic tasks, notably:

Image-Text Retrieval: When fine-tuned on MSCOCO and Flickr30k, Unicoder-VL demonstrated superior performance across sentence and image retrieval metrics. Specifically, on MSCOCO, it achieved an R@1 score of 84.3% and 69.7% for sentence and image retrieval, respectively.
Zero-shot Image-Text Retrieval: Evaluated without task-specific fine-tuning, Unicoder-VL still presented robust results, highlighting its generalization capabilities.
Visual Commonsense Reasoning (VCR): Demonstrating comparable or superior performance to state-of-the-art models like ViLBERT and VisualBERT, Unicoder-VL's results indicate its strengthened reasoning capabilities when fine-tuned for VCR tasks.

Discussion

Several observations emerge from the analysis:

Model Size: Performance improvements were noted proportionally with the increasing number of Transformer layers. A 24-layer model notably outperformed 6-layer and 12-layer configurations.
Pre-training Dataset Size: Scaling the dataset size from 3M to 3.8M image-caption pairs consistently improved retrieval performance, demonstrating the importance of extensive, diverse datasets in pre-training.
Comparison with Concurrent Works: Unicoder-VL’s architecture and pre-training tasks render it competitive against contemporary models such as UNITER and ViLBERT. Despite using fewer high-quality, in-domain datasets for pre-training compared to UNITER, Unicoder-VL performed admirably, indicating its robustness.

Future Directions

Future research could expand on several promising avenues:

Enhancing Pre-training Tasks: Investigate additional pre-training tasks that further align visual and linguistic understanding, potentially incorporating image-only inputs effectively.
Extending to Other Modalities: Explore applicability in video-related tasks such as video captioning and video-based question answering.
Fusion with Detection Models: Integrate fine-tuning of the underlying detection models alongside the cross-modal pre-training to potentially boost performance further.
Expanding Datasets: Incorporate a broader and more diverse range of high-quality image-caption datasets to enrich the pre-training process.

The research presented in "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training" underscores significant strides in the domain of multi-modal AI, paving the way for future advancements in integrated vision and language processing tasks. The model's architecture, combined with its pre-training methodology, promises versatile applications and a robust foundation for further explorations.

PDF Markdown

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training (1908.06066v3)

Summary

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Related Papers