LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages (2407.05975v2)

Published 8 Jul 2024 in cs.CL and cs.AI

Abstract: LLMs demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we conduct extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs (by more than 10 spBLEU points) and performs on-par with specialized translation model (M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code \footnote{\url{https://github.com/CONE-MT/LLaMAX/.}} and the models \footnote{\url{https://huggingface.co/LLaMAX/.}} are publicly available.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces multilingual data augmentation and vocabulary expansion techniques to enhance translation performance across 102 languages.
It details a novel training data construction with continual pre-training over 35,000 A100 GPU hours, balancing monolingual and parallel data effectively.
Empirical evaluations demonstrate over 10 spBLEU improvement for low-resource translations, matching specialized models like M2M-100-12B on key benchmarks.

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Overview

The paper "LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages" presents an exhaustive paper on enhancing the translation capabilities of LLMs, particularly the LLaMA series, to support over 100 languages. The performance of LLMs in high-resource language translations is notable. However, their efficacy in low-resource languages remains hindered by the paucity of multilingual data during pre-training. The authors address this limitation through extensive multilingual continual pre-training, utilizing 35,000 A100 GPU hours, culminating in the development of the LLaMAX models.

Key Contributions

Multilingual Data Augmentation and Vocabulary Expansion:
- The paper investigates techniques such as vocabulary expansion and dictionary-based data augmentation for improving translation performance.
- They find that the original LLaMA vocabulary suffices for extending multilingual capabilities without the need for language-specific tokens, which simplifies the model's architecture and reduces complexity.
Training Data Construction:
- The authors meticulously assemble a massive multilingual dataset, ensuring thorough coverage with monolingual and parallel data, specifically targeting 102 languages.
- A novel algorithm for training data construction during each training epoch is devised, ensuring an optimal balance of data from different languages.
Continual Pre-training on LLaMA:
- The LLaMAX models are trained over an extended period (60 days) on 24 A100 GPUs.
- They merge existing foundational model training with multilingual data, avoiding catastrophic forgetting and maintaining robust general task performance.
Empirical Evaluation:
- The paper provides comprehensive benchmarking using the Flores-101, Flores-200, and other datasets to evaluate the model across different language tasks.
- LLaMAX demonstrates an average improvement of over 10 spBLEU points for low-resource-centric translations and achieves parity with specialized models like M2M-100-12B on several tasks.
Broader Implications and Future Directions:
- The authors establish that enhanced translation capabilities also reinforce the model's performance in other multilingual tasks.
- They propose future work focusing on refining the pre-training process and expanding coverage to include more low-resource languages effectively.

Numerical and Empirical Insights

The numerical results presented underscore the significant improvement LLaMAX brings to translation tasks. Specifically, the average gain of more than 10 spBLEU points in low-resource language translations compared to baseline models is particularly noteworthy. The comparative analysis, as shown in Table~\ref{tab:flores}, places LLaMAX on par with specialized models like M2M-100-12B, especially for languages included in the Flores-101 benchmark.

Theoretical and Practical Implications

From a theoretical standpoint, the work elucidates the efficacy of using the existing LLM vocabulary for supporting multilingual tasks, challenging the conventional approach of introducing language-specific tokens. Practically, the substantial computational resources and strategic continual pre-training enable the development of a robust, multilingual foundational model, making significant strides in low-resource language support.

Speculation on Future Developments

The continual pre-training approach detailed in this paper sets the stage for future exploration in several areas:

Scaling the coverage to more languages, particularly those with minimal digital footprints.
Enhancing data augmentation techniques, perhaps incorporating more sophisticated multilingual dictionaries or generative methods to further bridge gaps in low-resource language data.
Fine-tuning for domain-specific applications to leverage the broader language support in specialized contexts such as legal, medical, or technical documents.

Overall, the paper makes a compelling case for the potential of LLMs to transcend their traditional limitations with concerted efforts in multilingual data augmentation and strategic computational investments. The LLaMAX models not only pave the way for more inclusive digital communication but also serve as a blueprint for future advancements in multilingual NLP.

PDF Markdown

Related Papers

GitHub

GitHub - CONE-MT/LLaMAX (61 stars)

Tweets

https://twitter.com/t_feyuan/status/1810538552008253483

https://twitter.com/lileics/status/1811246022334517298

https://twitter.com/fly51fly/status/1812239488309137437

https://twitter.com/stateof_ai/status/1810579577766318135

https://twitter.com/gm8xx8/status/1810748232500400366

https://twitter.com/knishimae0531/status/1811369512291459472