Emergent Mind

Abstract

In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of LLMs, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how \VideoDatasetName{} enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~\cite{openflamingo}. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of \ModelName{} and \VideoDatasetName{} are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.

Overview

  • Introduces COSMO, a model that integrates contrastive loss with text generation for improved multimodal data processing.

  • COSMO separates into two segments for text and multimodal information, reducing parameters and enhancing task performance.

  • Presents Howto-Interlink7M dataset with high-quality captions for instructional videos, aiding in multimodal model training.

  • COSMO outperforms similar models like OpenFlamingo in various tasks, with fewer parameters and smaller dataset samples.

  • The paper's findings indicate potential for AI advancements in long-text multimodal applications and future research.

Introduction

AI research has been progressively advancing towards models that can understand and process not only text but also multimodal information—data that combines text with other forms such as images or videos. This development has come with its own set of challenges, especially in aligning and processing such multimodal data effectively.

COSMO Framework

Addressing these challenges, the paper introduces COSMO, a COntrastive-Streamlined MultimOdal Model, that integrates contrastive loss with text generation models. COSMO stands out by dividing the Large Language Model into two segments—one focusing on processing text and the other on fusing multimodal information. This distinction enables COSMO to efficiently manage both unimodal and multimodal tasks, showing an impressive reduction in learnable parameters and performance improvements in 14 different downstream tasks, including those involving images, texts, and video data.

Howto-Interlink7M Dataset

A key hurdle for training such models is the lack of quality long-text multimodal datasets. The paper addresses this by introducing a novel video-text data set called Howto-Interlink7M. Derived by annotating segments of instructional videos, this dataset stands out for its high-quality, detailed captions that preserve narrative coherence across video clips. The dataset has a substantial impact on the COSMO model's performance, improving it even further in various image-text and video-text tasks.

Performance and Evaluation

When compared to OpenFlamingo, a similar autoregressive vision-language model, COSMO shows a pronounced improvement in model performance even though it employs fewer learnable parameters and a smaller dataset sample size. This advantage is particularly noticeable in challenging tasks such as the Flickr captioning task, where COSMO outstrips OpenFlamingo's performance by a significant margin.

Conclusion

The integration of contrastive loss into multimodal learning frameworks, along with the development of high-quality, long-text datasets, represents a promising direction for AI research. The advancements made by COSMO and the Howto-Interlink7M dataset not only set new standards for multimodal tasks but also open up extensive opportunities for future research, especially in the realm of long-text data applications. The release of the trained models and datasets is eagerly anticipated, with the potential to catalyze further research in the field.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.