CCMB: A Large-scale Chinese Cross-modal Benchmark

Published 8 May 2022 in cs.CV and cs.AI | (2205.03860v6)

Abstract: Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks. Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking strategy to learn powerful vision-language representations and a two-way distillation method (i.e., target-guided Distillation and feature-guided Distillation) to further enhance the learning capability. With the Zero and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve downstream datasets from five broad categories of tasks including image-text retrieval, image-text matching, image caption, text-to-image generation, and zero-shot image classification. The datasets, models, and codes are available at https://github.com/yuxie11/R2D2

Abstract PDF Upgrade to Chat

Authors (14)

Citations (5)

View on Semantic Scholar

Summary

The paper presents Zero, a pre-training dataset of 250M images and 750M captions filtered by CTR, enhancing image-text reliability.
The paper introduces R2D2, a hybrid framework combining dual-stream and single-stream methods with global contrastive pre-ranking and two-way distillation.
The paper validates its approach across 12 datasets, achieving superior results in image-text retrieval, matching, and captioning tasks.

The exploration and development of vision-language pre-training (VLP) have predominantly focused on large-scale datasets with English corpora, leaving a gap in resources for Chinese cross-modal pre-training and downstream tasks. This paper addresses this void by introducing the Chinese Cross-Modal Benchmark (CCMB), a comprehensive dataset, and a sophisticated pre-training framework named R2D2. These contributions not only offer substantial resources to the research community but also advance the development of high-quality vision-LLMs tailored for Chinese language contexts.

Key Contributions: CCMB and its Pre-Training Dataset Zero

1. Zero: A High-Caliber Pre-Training Dataset.

The cornerstone of CCMB is its pre-training dataset, Zero, which consists of 250 million images and 750 million text descriptions, identified through a meticulous filtering method based on user click-through rate (CTR). This sorting mechanism ensures high relevance and quality of the dataset, given that higher CTRs denote stronger correlation within image-text pairs. The unique aspect of Zero lies in its provision of multiple text descriptions per image, enhancing data diversity—an attribute crucial for developing robust vision-LLMs for Chinese contexts.

2. Comprehensive Evaluation Suite: Downstream Datasets.

The paper expands its vision to downstream applications by offering five human-annotated datasets covering tasks such as image-text retrieval, image-text matching, image captioning, and more. With datasets such as the large-scale Image-Caption Matching (ICM) and Image-Query Retrieval (IQR), CCMB provides a rich evaluation ground for vision-LLMs.

The R2D2 Framework: Advanced Vision-Language Representation

The R2D2 framework capitalizes on a combinatorial architecture that integrates dual-stream and single-stream methodologies. This design enhances the model’s ability to interpret nuanced interactions between visual and textual data. The framework introduces several innovative methodological components:

Global Contrastive Pre-Ranking (GCPR): Leveraging the strengths of systemic contrastive learning, the framework unifies image and text representations across multiple processors, utilizing queues for stable representation learning.
Fine-Grained Ranking (FGR): Complementing GCPR, FGR facilitates the detailed appraisal of image-text pairs, further refining model understanding.
Two-way Distillation (TwD): A dual-faceted distillation strategy, combining target-guided and feature-guided learning, enhances robustness against noisy labels and improves generalization capabilities.
Enhanced Training for MLM (ET): The framework optimizes MLM through concurrent execution with FGR, reducing computational resources without compromising performance.

Performance Assessment Across Multiple Domains

The empirical validation on twelve datasets spanning image-text retrieval, matching, and captioning, showcases the superior performance of the CCMB and R2D2. R2D2 demonstrates leading results across these tasks, highlighting the framework's effectiveness in learning detailed semantic associations in vision-language contexts. The benchmark results indicate the comprehensive improvement achieved across multilingual modalities, thereby advancing the state-of-the-art in Chinese VLP.

Theoretical and Practical Implications

The introduction of a large-scale, diverse dataset for Chinese cross-modal applications plays a pivotal role in fostering further research and development in AI domains. This paper's strategic deployment of CTR-filtered data raises the bar for dataset quality standards, while the innovative R2D2 framework sets a precedent for model architectures that can efficiently handle complex multimodal tasks.

Prospective Developments

Future trajectories may involve extending the CCMB to explore additional languages, thereby aligning it with multilingual and multicultural contexts. Enhanced model architectures could also integrate richer syntactic and semantic layers to capture the nuance of nuanced cultural contexts. Further, incorporating other advanced AI strategies such as self-supervised learning and meta-learning could enhance adaptability and generalization.

In essence, this paper provides an in-depth resource and methodology that significantly enrich the field of cross-modal learning, with notable implications for enhancing interactive AI systems capable of multilingual processing and understanding.

Markdown Report Issue