ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic

Published 27 Dec 2020 in cs.CL | (2101.01785v3)

Abstract: Pre-trained LMs are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large (~ 3.4 x larger size). Our models are publicly available at https://github.com/UBC-NLP/marbert and ARLUE will be released through the same repository.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (418)

View on Semantic Scholar

Summary

The paper introduces ARBERT and MARBERT, specialized deep bidirectional transformers that achieve state-of-the-art results across 48 Arabic NLP tasks.
It details model architectures with 12 layers, 768 hidden units, and 163M parameters while leveraging diverse Arabic text from dialects and social media.
The ARLUE benchmark, encompassing 42 datasets across six task categories, ensures rigorous and standardized evaluation of Arabic language models.

Analyzing Deep Bidirectional Transformers for Arabic Language Understanding

The current landscape in NLP is dominated by the use of pre-trained LLMs (PLMs) such as BERT and RoBERTa, which have significantly enhanced the ability to carry out many diverse NLP tasks through transfer learning. These models, initially developed with a strong focus on English, have spurred interest in multilingual variants such as mBERT and XLM-RoBERTa. However, despite their broad coverage, these models face certain limitations, particularly in dealing with languages where data is more sparse or diverse, or is less aligned with English in terms of syntactic and semantic norms. Among such languages is Arabic, which is characterized by a plethora of dialects in addition to Modern Standard Arabic (MSA).

In response to these challenges, researchers from the University of British Columbia have introduced two new LLMs—ARBERT and MARBERT—specifically optimized for Arabic. These models utilize deep bidirectional Transformer architectures and focus on addressing the distinctive linguistic features of Arabic and its dialects. Complementing the models, the Arabic Language Understanding Evaluation (ARLUE) benchmark has been designed to optimize the testing and validation of NLP systems working across diverse Arabic dialects, facilitating rigorous evaluation through a series of standardized experiments.

Key Features and Contributions

Model Architecture and Data: ARBERT and MARBERT both employ the BERT\textsubscript{Base} architecture, with 12 layers, 768 hidden units, and 12 heads, amassing around 163 million parameters. Training data for these models includes a substantial amount of Arabic text drawn from a variety of sources to encompass the linguistic diversity of both MSA and colloquial Arabic. MARBERT is distinguished by data gleaned from social media, capturing the nuances of dialectal Arabic.
Evaluation Benchmark: The introduction of ARLUE is particularly noteworthy. Comprising 42 datasets and targeting six broad task categories—sentiment analysis, social meaning, topic classification, dialect identification, named entity recognition, and question answering—ARLUE provides an extensive framework for evaluating Arabic NLP models. This is a significant contribution in fostering consistency and rigor in model comparisons.
State-of-the-Art Results: ARBERT and MARBERT achieve new state-of-the-art results in 37 out of 48 classification tasks within ARLUE, underscoring their efficacy across the spectrum of tasks. Notably, the MARBERT model achieves the highest ARLUE score of 77.40, surpassing other models, including the considerably larger XLM-R\textsubscript{Large}.

Implications and Future Directions

The development of ARBERT and MARBERT, alongside ARLUE, has meaningful implications for both practical applications and theoretical research in NLP. The models highlight the importance of tailoring PLMs to specific languages and dialects, especially those with significant linguistic variation. The successful implementation of these models also points to the viability of using medium-to-large models that balance performance with computational efficiency—a factor of growing concern in responses to increasing model sizes.

Theoretical implications extend to the continued exploration of PLMs in multilingual contexts, as well as the development of benchmarks that can standardize evaluations across languages. Future research may benefit from the incorporation of additional language and dialect resources, as well as further exploration into energy-efficient training methodologies for PLMs.

In conclusion, ARBERT and MARBERT signify a laudable step forward in addressing the challenge of NLP for Arabic and its dialects. The release of ARLUE further bolsters the infrastructure necessary for continued innovation in the field. These contributions are essential as the NLP community continues to seek models that are not only versatile but also cognizant of the intricate diversity inherent in human languages.

Markdown Report Issue