mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus (2406.08707v2)

Published 13 Jun 2024 in cs.CL and cs.CV

Abstract: Multimodal LLMs (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. (2022) showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 303M documents, 200B tokens and 1.15B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model trained on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs. The dataset is released under the Creative Commons CC BY 4.0 license and can be accessed here: https://huggingface.co/datasets/oscar-corpus/mOSCAR

Citations (1)

View on Semantic Scholar

Summary

The paper introduces mOSCAR, a pioneering multilingual and multimodal document corpus covering 315 million documents across 163 languages.
It outlines a robust extraction and filtering pipeline using techniques like open-LID and MinHashLSH to ensure high-quality, diverse data.
Experimental results demonstrate significant improvements in few-shot multilingual performance and text-image diversity over traditional datasets.

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus: A Critical Review

The paper "mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus" by Matthieu Futeral et al. introduces mOSCAR, an extensive multilingual and multimodal dataset. Unlike previous models which primarily relied on monolingual, caption-like data, mOSCAR acknowledges the importance of expanding the linguistic and cultural scope within the AI ecosystem.

Introduction to mOSCAR

mOSCAR is presented as the first large-scale dataset of its kind, encompassing 315 million documents, 214 billion tokens, and 1.2 billion images across 163 languages. The primary motivation behind mOSCAR is to overcome the limitations posed by existing datasets, which are often English-only and restricted to caption-like data. By providing a diversified corpus, mOSCAR aims to facilitate research and development in Multimodal LLMs (mLLMs) that cater to a broader linguistic spectrum.

Data Collection and Methodology

The data for mOSCAR is sourced from Common Crawl, processed through a rigorous pipeline involving multiple steps:

Text and Image Extraction: Initial pruning removes ambiguities by filtering irrelevant and low-quality content.
Language Identification: Implementing open-LID ensures the accurate classification of each document into one of 201 languages supported.
Text and Image Filtering: Heuristics-based and model-based filtering ensure high-quality content. Key steps include eliminating NSFW content using models like nsfw-detector and NudeNet.
Deduplication: Techniques like MinHashLSH are employed to maintain the uniqueness within and across documents.
Combining Modalities: Joint text-image filtering is applied to ensure coherence between text and image data within documents.

Through these extensive stages, mOSCAR achieves a balance between data quality and diversity, making it a valuable resource for multilingual research.

Dataset Evaluation

Quality and Diversity Metrics:

Text Content: Evaluated via perplexity (using Gemma-2B) to assess quality and Vendi score (SimCSE embeddings) for diversity.
Image Content: Assessing image diversity using the Vendi Score, compared against datasets like LAION-400M demonstrated mOSCAR's superior text-image diversity.

Experimental Evaluation

To validate the utility of mOSCAR, the authors trained a multilingual OpenFlamingo model on a subset of mOSCAR combined with captioning data from LAION-400M. The model was evaluated on diverse benchmarks encompassing visual question answering (xGQA, MaXM), captioning (xFlickrCO, XM3600), reasoning (XVNLI, MaRVL), and cross-modal machine translation (Multi30K, CoMMuTE).

Results:

The mOSCAR-enhanced model outperformed those trained on captioning data alone, particularly in few-shot learning scenarios.
Significant boosts in multilingual few-shot performance were observed across tasks, highlighting the importance of document-level interleaved training data.
Evaluation on translate-test benchmarks underscored the advantage of mOSCAR in translating textual content and performing zero-shot disambiguation.

Implications and Future Directions

Theoretical Implications:

The inclusion of a wide range of languages ensures that models are more robust and capable of covering underrepresented populations.
Document-level interleaved data offers a stronger basis for in-context learning, as demonstrated by superior performance metrics.

Practical Implications:

mOSCAR sets a new standard for dataset creation, emphasizing both multilingual and multimodal data.
This dataset opens new avenues for enhancing the linguistic inclusivity of AI models, crucial for applications in global contexts.

Future Developments:

Moving forward, a critical area for expansion is the inclusion of more low-resource languages by further optimized web-crawling and data filtering techniques.
Long-term research should also focus on diminishing the inherent biases and toxicity that may arise from web-crawled content, ensuring safer and more equitable AI development.

Conclusion

In summary, mOSCAR represents a pivotal advancement in the field of multilingual and multimodal datasets. By offering a comprehensive corpus that spans 163 languages and multiple modalities, it addresses significant gaps in the current landscape of mLLM research. The dataset not only enhances few-shot learning performance but also fosters broader inclusivity, making it an invaluable resource for the future of AI development.

PDF Markdown

Related Papers

Tweets

https://twitter.com/FuteralMatthieu/status/1802639108566634875