Emergent Mind

Abstract

LLMs~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs~(by more than 10 spBLEU points) and performs on-par with specialized translation model~(M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code~\footnote{\url{https://github.com/CONE-MT/LLaMAX/.}} and models~\footnote{\url{https://huggingface.co/LLaMAX/.}} are publicly available.

LLaMAX2-Alpaca vs. LLaMA2-Alpaca on Flores-200: Multilingual pre-training boosts translation despite incomplete coverage.

Overview

  • The paper investigates enhancing the translation capabilities of LLMs to support over 100 languages, particularly focusing on the LLaMA series through extensive multilingual continual pre-training.

  • Key techniques include vocabulary expansion, dictionary-based data augmentation, and assembling a massive multilingual dataset to optimize translation performance across both high-resource and low-resource languages.

  • Evaluations using benchmarks like Flores-101 and Flores-200 demonstrate significant improvements in low-resource language translations, achieving parity with specialized models like M2M-100-12B.

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Overview

The paper titled "LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages" presents an exhaustive study on enhancing the translation capabilities of LLMs, particularly the LLaMA series, to support over 100 languages. The performance of LLMs in high-resource language translations is notable. However, their efficacy in low-resource languages remains hindered by the paucity of multilingual data during pre-training. The authors address this limitation through extensive multilingual continual pre-training, utilizing 35,000 A100 GPU hours, culminating in the development of the LLaMAX models.

Key Contributions

  1. Multilingual Data Augmentation and Vocabulary Expansion:

    • The paper investigates techniques such as vocabulary expansion and dictionary-based data augmentation for improving translation performance.
    • They find that the original LLaMA vocabulary suffices for extending multilingual capabilities without the need for language-specific tokens, which simplifies the model's architecture and reduces complexity.
  2. Training Data Construction:

    • The authors meticulously assemble a massive multilingual dataset, ensuring thorough coverage with monolingual and parallel data, specifically targeting 102 languages.
    • A novel algorithm for training data construction during each training epoch is devised, ensuring an optimal balance of data from different languages.
  3. Continual Pre-training on LLaMA:

    • The LLaMAX models are trained over an extended period (60 days) on 24 A100 GPUs.
    • They merge existing foundational model training with multilingual data, avoiding catastrophic forgetting and maintaining robust general task performance.
  4. Empirical Evaluation:

    • The paper provides comprehensive benchmarking using the Flores-101, Flores-200, and other datasets to evaluate the model across different language tasks.
    • LLaMAX demonstrates an average improvement of over 10 spBLEU points for low-resource-centric translations and achieves parity with specialized models like M2M-100-12B on several tasks.
  5. Broader Implications and Future Directions:

    • The authors establish that enhanced translation capabilities also reinforce the model's performance in other multilingual tasks.
    • They propose future work focusing on refining the pre-training process and expanding coverage to include more low-resource languages effectively.

Numerical and Empirical Insights

The numerical results presented underscore the significant improvement LLaMAX brings to translation tasks. Specifically, the average gain of more than 10 spBLEU points in low-resource language translations compared to baseline models is particularly noteworthy. The comparative analysis, as shown in Table~\ref{tab:flores}, places LLaMAX on par with specialized models like M2M-100-12B, especially for languages included in the Flores-101 benchmark.

Theoretical and Practical Implications

From a theoretical standpoint, the work elucidates the efficacy of using the existing LLM vocabulary for supporting multilingual tasks, challenging the conventional approach of introducing language-specific tokens. Practically, the substantial computational resources and strategic continual pre-training enable the development of a robust, multilingual foundational model, making significant strides in low-resource language support.

Speculation on Future Developments

The continual pre-training approach detailed in this study sets the stage for future exploration in several areas:

  • Scaling the coverage to more languages, particularly those with minimal digital footprints.
  • Enhancing data augmentation techniques, perhaps incorporating more sophisticated multilingual dictionaries or generative methods to further bridge gaps in low-resource language data.
  • Fine-tuning for domain-specific applications to leverage the broader language support in specialized contexts such as legal, medical, or technical documents.

Overall, the paper makes a compelling case for the potential of LLMs to transcend their traditional limitations with concerted efforts in multilingual data augmentation and strategic computational investments. The LLaMAX models not only pave the way for more inclusive digital communication but also serve as a blueprint for future advancements in multilingual NLP.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

GitHub