MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages

Published 18 Apr 2022 in cs.CL, cs.AI, and cs.LG | (2204.08582v2)

Abstract: We present the MASSIVE dataset--Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation. MASSIVE contains 1M realistic, parallel, labeled virtual assistant utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize the English-only SLURP dataset into 50 typologically diverse languages from 29 genera. We also present modeling results on XLM-R and mT5, including exact match accuracy, intent classification accuracy, and slot-filling F1 score. We have released our dataset, modeling code, and models publicly.

Abstract PDF Upgrade to Chat

Authors (16)

First 10 authors:

Citations (109)

View on Semantic Scholar

Summary

The paper introduces MASSIVE as a large-scale multilingual corpus covering 51 typologically diverse languages for NLU tasks like slot-filling and intent classification.
The dataset is built by localizing the English SLURP dataset into 50 languages using professional translators, ensuring natural and high-quality linguistic data.
The paper benchmarks pre-trained models such as XLM-R and mT5, revealing performance variations that highlight challenges in zero-shot settings and low-resource languages.

Overview of the MASSIVE Dataset Paper

The paper introduces MASSIVE, a comprehensive multilingual dataset designed for Natural Language Understanding (NLU) across 51 languages. This dataset, constituting 1 million examples, is specifically developed for slot-filling, intent classification, and virtual assistant evaluation. It significantly extends the capabilities of existing multilingual NLU research by encompassing typologically diverse languages, allowing for extensive experimentation in cross-lingual and multilingual contexts.

Dataset Composition and Collection

The MASSIVE dataset contains parallel labeled virtual assistant utterances spanning diverse domains, intents, and slots. The dataset was created by localizing the English-only SLURP dataset into 50 additional languages, using professional translators to ensure natural and realistic language data. Importantly, the dataset consists of training, validation, test, and a held-out evaluation set designed for competitive benchmarking.

The collection of such a dataset was meticulously executed using an elaborate workflow involving translation and localization tasks, followed by quality assurance phases to maintain high data integrity. This detailed approach ensures both linguistic coverage and accuracy, making the MASSIVE dataset a valuable resource for developing and evaluating multilingual NLU models.

Linguistic Diversity and Selection Criteria

MASSIVE's linguistic diversity is achieved by incorporating languages from 14 different families and 21 distinct scripts, representing a wide array of grammatical structures and typological features. The selection criteria for languages included cost constraints, existing support in major virtual assistants, typological and script diversity, and their prevalence in digital communication mediums.

The dataset introduces unique opportunities to study less explored linguistic phenomena such as imperative marking, word order variations, and politeness systems in device-directed speech. This diversity not only enhances the dataset's application in practical multilingual systems but also contributes to theoretical linguistic research.

Benchmarking and Modeling Results

The paper presents modeling results using pre-trained models such as XLM-R and mT5, applied to the NLU tasks within the MASSIVE dataset. The experiments demonstrate varied performance across languages, indicating the influence of pre-training data quantity and typological factors on model efficacy. While models exhibit strong performance on languages with richer pre-training data, zero-shot settings reveal notable challenges, necessitating further exploration of unsupervised learning and data augmentation techniques.

Statistical analyses highlight the correlation between language representation in pre-training and task performance, emphasizing the role of balanced multilingual data in enhancing model robustness. These insights suggest promising directions for future research, including more sophisticated tokenization for non-Latin scripts and enhanced fine-tuning strategies for low-resource languages.

Implications and Future Directions

The release of the MASSIVE dataset is poised to catalyze advancements in multilingual NLU technologies and theoretical linguistics. Its unprecedented scale and scope make it a cornerstone for developing multilingual systems that cater to diverse linguistic needs worldwide. Moreover, its integration into competitive settings will likely push the boundaries of cross-lingual transfer learning and multilingual model architectures.

Looking forward, the dataset opens pathways for innovative approaches in machine translation, linguistic analyses, and the practical deployment of virtual assistants supporting a broader array of languages. As researchers continue to build upon this foundation, the MASSIVE dataset will play a pivotal role in bridging gaps in multilingual understanding and enabling more inclusive AI technologies.

Markdown Report Issue