mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections (2205.12005v2)

Published 24 May 2022 in cs.CL and cs.CV

Abstract: Large-scale pretrained foundation models have been an emerging paradigm for building AI systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to multiple video-language tasks.

References (69)

Citations (181)

View on Semantic Scholar

Summary

The paper presents a novel cross-modal skip-connected architecture that fuses visual and textual representations while reducing computational cost.
The model is pre-trained on 14M image-text pairs with multiple objectives, achieving state-of-the-art performance on image-text retrieval and captioning tasks.
Empirical results highlight significant improvements in fine-grained cross-modal alignment and robust zero-shot transferability across diverse vision-language applications.

The paper "mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections" elucidates a significant advancement in the domain of vision-language pre-training (VLP) models. The paper introduces mPLUG, a novel architecture designed to enhance both cross-modal understanding and generation tasks.

mPLUG Architecture

The mPLUG model addresses inherent challenges in VLP, particularly the issues of computational inefficiency and information asymmetry between visual and textual modalities. Traditional VLP approaches often involve the use of pre-trained object detectors or long sequences of image patches, which are computationally expensive and inefficient. Moreover, these methods struggle with the disparity between the detailed visual data and abstract textual descriptions.

To mitigate these challenges, mPLUG employs a unique cross-modal skip-connected architecture. This setup allows the model to effectively bypass certain visual layers while maintaining the semantic integrity of the input. The architecture comprises two unimodal encoders for images and text, which are then connected via a transformer-based cross-modal skip-connected network. This design ensures efficient multi-modal fusion by aligning visual and textual representations at disparate levels of abstraction, thereby addressing the information asymmetry problem.

Training Protocol

mPLUG is pre-trained on a substantial dataset of 14 million image-text pairs, incorporating multiple objectives such as Image-Text Contrastive Learning, Image-Text Matching, Masked LLMing, and Prefix LLMing. The pre-training strategy is critical in initializing the model's parameters for robust zero-shot transferability across various tasks, including image-text retrieval, captioning, and visual question answering.

Experimental Results

The empirical evaluation highlights mPLUG's superior performance across several VLP benchmarks:

Image-Text Retrieval: mPLUG demonstrates remarkable retrieval accuracy, achieving state-of-the-art performance on the Flickr30K and MSCOCO datasets. The registered recall@1 scores substantiate the model's efficacy in capturing fine-grained cross-modal associations.
Image Captioning: The model excels in image captioning tasks, evidenced by high CIDEr scores on both COCO Cap and NoCaps datasets, outperforming previous benchmarks.
Visual Question Answering (VQA): Notably, mPLUG achieves substantial gains in VQA tasks, surpassing models that leverage extensive pre-training data, such as SimVLM and Florence.
Visual Grounding and Reasoning: The architecture's design facilitates unprecedented performance on visual grounding tasks like RefCOCO and visual reasoning datasets such as NLVR2 and SNLI-VE.

Implications and Future Prospects

The introduction of mPLUG marks a pivotal step towards more efficient and effective multi-modal learning systems. This architecture not only enhances model efficiency by reducing computational load but also ensures information-rich cross-modal encoding, benefiting numerous downstream applications in AI. The robustness of mPLUG in zero-shot settings also paves the way for more generalized AI systems capable of transferring knowledge across domains without extensive re-training.

Moving forward, further research could explore scaling mPLUG's architecture to accommodate additional modalities and investigate its application in real-time multi-modal systems. The exploration of skip-connections in other multi-modal contexts may yield insights beneficial to the broader field of AI.

PDF Markdown