M6: A Chinese Multimodal Pretrainer

Published 1 Mar 2021 in cs.CL | (2103.00823v4)

Abstract: In this work, we construct the largest dataset for multimodal pretraining in Chinese, which consists of over 1.9TB images and 292GB texts that cover a wide range of domains. We propose a cross-modal pretraining method called M6, referring to Multi-Modality to Multi-Modality Multitask Mega-transformer, for unified pretraining on the data of single modality and multiple modalities. We scale the model size up to 10 billion and 100 billion parameters, and build the largest pretrained model in Chinese. We apply the model to a series of downstream applications, and demonstrate its outstanding performance in comparison with strong baselines. Furthermore, we specifically design a downstream task of text-guided image generation, and show that the finetuned M6 can create high-quality images with high resolution and abundant details.

Abstract PDF Upgrade to Chat

Authors (25)

First 10 authors:

Citations (127)

View on Semantic Scholar

Summary

The paper introduces a massive M6-Corpus, the largest Chinese multimodal dataset, enhancing cross-modal learning with 1.9TB images and 292GB text.
The model employs a unified transformer architecture with Mixture-of-Experts to scale up to 10B and 100B parameters efficiently.
M6 achieves significant improvements in VQA, image captioning, and text-to-image generation, marking a breakthrough in multimodal AI research.

M6: A Chinese Multimodal Pretrainer

This paper presents the development and evaluation of M6, a large-scale Chinese multimodal pretraining model. The authors construct the largest dataset for Chinese multimodal pretraining to date, incorporating over 1.9TB of images and 292GB of text. This wide-ranging dataset, named M6-Corpus, spans multiple domains including encyclopedic entries, forum discussions, and e-commerce data, facilitating a deep understanding of both single-modality and cross-modality content.

M6 employs a novel pretraining approach designed to process large amounts of data across modalities using a unified model architecture. The model itself is scaled to significant sizes with 10 billion and 100 billion parameters for the M6-10B and M6-100B variants, respectively. The inclusion of Mixture-of-Experts (MoE) architectures allows M6-100B to efficiently handle this scale through sparse activation, reducing the complexity traditionally associated with models of this magnitude.

The paper presents several key contributions:

Dataset Construction: The M6-Corpus is introduced as a substantial resource for Chinese multimodal research. It establishes a benchmark for future studies by encompassing not only textual data but also detailed image-text pairs.
Model Architecture: The M6 model integrates both encoder and decoder functionalities into a single framework using a transformer-based architecture. This allows the model to perform tasks such as text-to-text and image-to-text generation using shared resources, enhancing training efficiency and model versatility.
Scalability: The authors implement extensive training infrastructure improvements to support large-scale model training. This includes leveraging distributed training techniques and optimizing communications for MoE structures, enabling the scale-up to 100 billion parameters.
Performance: M6 demonstrates competitive performance across various tasks, outperforming strong baselines in tasks such as Visual Question Answering (VQA), image captioning, and image-text matching. Notably, M6 shows an 11.8% improvement in VQA accuracy and a 10.3% improvement in image-text matching against comparable models.
Text-to-Image Generation: An innovative contribution is the application of M6 in text-to-image generation, where the model successfully generates high-quality and detail-rich images from textual descriptions. This capability opens new avenues for creative applications within design and e-commerce.

The implications of this study are multifaceted. Practically, M6 can be directly applied to various industries such as e-commerce, enhancing product description generation and customer interaction capabilities. Theoretically, the work demonstrates the potential of cross-modal pretraining at scale, setting a precedent for future multimodal AI developments, particularly in non-English contexts.

Future work may involve expanding the dataset further, refining the pretraining tasks to better leverage cross-modal information, and improving training efficiency and model interpretability to expedite the deployment of such large-scale models in more practical settings. The work with M6 is a significant step towards harnessing large-scale data for multimodal AI systems and demonstrates the potential impact of such technologies in diverse applications.

Markdown Report Issue