MaLA-500: Massive Language Adaptation of Large Language Models (2401.13303v2)

Published 24 Jan 2024 in cs.CL

Abstract: LLMs have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel LLM designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages. We release MaLA-500 at https://huggingface.co/MaLA-LM

References (47)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces a strategy that adapts large language models to 534 languages using vocabulary extension and continued pretraining.
It leverages a multilingual tokenizer integrated with LLaMA 2 to reduce segmentation length and enhance encoding efficiency.
Evaluations on the SIB-200 dataset demonstrate improved in-context learning, achieving robust performance even in low-resource languages.

Introduction

The development of LLMs such as LLaMA, Mistral, and ChatGPT has significantly advanced natural language processing, particularly for English and other high-resource languages. Nevertheless, the effectiveness of these models diminishes for low-resource languages due to data scarcity and limited model capacity. In light of this, "MaLA-500: Massive Language Adaptation of LLMs" has emerged as an innovative solution to bridge this linguistic divide, aimed at drastically enhancing language coverage across 534 languages by employing vocabulary extension and continued pretraining on a substantial new corpus, Glot500-c. The evaluation on SIB-200 showcases the enhanced in-context learning afforded by MaLA-500.

Methodology

The authors' methodology pivots on four main facets: high-quality data, a solid foundation model, rigorous vocabulary extension, and strategic continued pretraining. The pivotal choice of LLaMA 2 as the base model bridges existing gaps, leveraging its training on 2 trillion tokens. The salient step of vocabulary extension involves integrating a newly trained multilingual tokenizer with the existing one from LLaMA 2. This extension paves the way for improved encoding efficiency across a myriad of languages, reflected in a dramatic reduction in segmentation length, an especially pronounced benefit for languages written in non-Latin scripts.

Continued pretraining is executed with the integration of LoRA to maintain efficiency, gradually refining the model's ability to elucidate from new data across an expansive linguistic landscape. Concomitant hardware prowess and optimal software deployment, including the use of state-of-the-art frameworks and sophisticated redundancy optimizers, underpin the efficient and environmentally mindful training process.

Evaluation

"Mala-500" has been meticulously evaluated against the backdrop of contemporary LLMs on the SIB-200 dataset. The model's unparalleled 3-shot in-context learning performance is reflected in its substantial lead over peers. The nuanced analysis showcases MaLA-500’s robustness, significantly minimizing languages with poor performance while elevating those with an accuracy surpassing 60%. Further, the model's adaptability shines across varying levels of data resourcefulness, underpinning the utility of vocabulary extension and corroborating its correlation with performance gains. The experiment additionally sheds light on the link between the number of in-context shots and accuracy, delineating how MaLA-500 reaches optimal performance with 6-10 shots.

The literature stretches across multilingual model prowess, citing illustrious predecessors in mBERT, XLM-R, mGPT, and others. The concurrent exploration of models like Glot500-m and SERENGETI reveals the mounting ambition to accommodate an ever-growing number of languages. However, "MaLA-500" sets itself apart with its herculean language coverage via continual training and utilizing an open model architecture.

In conclusion, the paper underlines a landmark stride in LLMs, tailoring to an unprecedented scope of languages while conscientiously considering the computational and environmental costs. The public release of model weights sets the stage for broadened future research and application scopes. Despite its strengths, the paper duly acknowledges limitations including data inclusivity of high-resource languages and the maximum parameter cap. Moreover, it raises an ethical spotlight on the potential propagation of biases. The groundwork laid down in "MaLA-500" thus stands as a clarion call for extended research pathways, ethical diligence, and continued technological inclusivity.

PDF Markdown

Tweets

https://twitter.com/arankomatsuzaki/status/1750336867768107499

https://twitter.com/_akhaliq/status/1750380825214857246

https://twitter.com/shaoxiongji/status/1775781717871083669

https://twitter.com/mgaido91/status/1826535362413367642

https://twitter.com/fly51fly/status/1750635308679594350

https://twitter.com/TheTuringPost/status/1752337863294029927