Emergent Mind

MaLA-500: Massive Language Adaptation of Large Language Models

(2401.13303)
Published Jan 24, 2024 in cs.CL

Abstract

LLMs have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages. We release MaLA-500 at https://huggingface.co/MaLA-LM

Overview

  • MaLA-500 improves language coverage for LLMs across 534 languages, focusing on low-resource ones through vocabulary extension and continued pretraining on Glot500-c corpus.

  • The methodology includes utilizing a high-quality base model, LLaMA 2, with a new multilingual tokenizer and LoRA for efficient, environmentally conscious continued pretraining.

  • Evaluations on SIB-200 highlight MaLA-500's superior 3-shot in-context learning, especially for underrepresented languages, showing a correlation between the extended vocabulary and performance.

  • The paper positions MaLA-500 as a significant advance given its extensive language coverage, while acknowledging data inclusivity challenges and bias propagation concerns.

  • MaLA-500's public model weights foster broader research possibilities and technological inclusion, urging further study and ethical vigilance.

Introduction

The development of LLMs such as LLaMA, Mistral, and ChatGPT has significantly advanced natural language processing, particularly for English and other high-resource languages. Nevertheless, the effectiveness of these models diminishes for low-resource languages due to data scarcity and limited model capacity. In light of this, "MaLA-500: Massive Language Adaptation of LLMs" has emerged as an innovative solution to bridge this linguistic divide, aimed at drastically enhancing language coverage across 534 languages by employing vocabulary extension and continued pretraining on a substantial new corpus, Glot500-c. The evaluation on SIB-200 showcases the enhanced in-context learning afforded by MaLA-500.

Methodology

The authors' methodology pivots on four main facets: high-quality data, a solid foundation model, rigorous vocabulary extension, and strategic continued pretraining. The pivotal choice of LLaMA 2 as the base model bridges existing gaps, leveraging its training on 2 trillion tokens. The salient step of vocabulary extension involves integrating a newly trained multilingual tokenizer with the existing one from LLaMA 2. This extension paves the way for improved encoding efficiency across a myriad of languages, reflected in a dramatic reduction in segmentation length, an especially pronounced benefit for languages written in non-Latin scripts.

Continued pretraining is executed with the integration of LoRA to maintain efficiency, gradually refining the model's ability to elucidate from new data across an expansive linguistic landscape. Concomitant hardware prowess and optimal software deployment, including the use of state-of-the-art frameworks and sophisticated redundancy optimizers, underpin the efficient and environmentally mindful training process.

Evaluation

"Mala-500" has been meticulously evaluated against the backdrop of contemporary LLMs on the SIB-200 dataset. The model's unparalleled 3-shot in-context learning performance is reflected in its substantial lead over peers. The nuanced analysis showcases MaLA-500’s robustness, significantly minimizing languages with poor performance while elevating those with an accuracy surpassing 60%. Further, the model's adaptability shines across varying levels of data resourcefulness, underpinning the utility of vocabulary extension and corroborating its correlation with performance gains. The experiment additionally sheds light on the link between the number of in-context shots and accuracy, delineating how MaLA-500 reaches optimal performance with 6-10 shots.

Related Work and Conclusion

The literature stretches across multilingual model prowess, citing illustrious predecessors in mBERT, XLM-R, mGPT, and others. The concurrent exploration of models like Glot500-m and SERENGETI reveals the mounting ambition to accommodate an ever-growing number of languages. However, "MaLA-500" sets itself apart with its herculean language coverage via continual training and utilizing an open model architecture.

In conclusion, the paper underlines a landmark stride in LLMs, tailoring to an unprecedented scope of languages while conscientiously considering the computational and environmental costs. The public release of model weights sets the stage for broadened future research and application scopes. Despite its strengths, the study duly acknowledges limitations including data inclusivity of high-resource languages and the maximum parameter cap. Moreover, it raises an ethical spotlight on the potential propagation of biases. The groundwork laid down in "MaLA-500" thus stands as a clarion call for extended research pathways, ethical diligence, and continued technological inclusivity.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.