Emergent Mind

LLaMA Beyond English: An Empirical Study on Language Capability Transfer

(2401.01055)
Published Jan 2, 2024 in cs.CL and cs.AI

Abstract

In recent times, substantial advancements have been witnessed in LLMs, exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.

Exploration of enhancing pretrained LLaMA models for non-English language proficiency, focusing on efficient capability transfer.

Overview

  • The study investigates transferring LLaMA model capabilities to non-English languages with low computational costs, analyzing factors like vocabulary extension, pretraining, and tuning.

  • Vocabulary extension does not show clear benefits; models with less training outperform those with extended vocabularies and more data in some cases.

  • While further pretraining doesn't significantly improve language generation, instruction tuning has a more pronounced impact, especially in preserving fluency and coherence.

  • Training exclusively in Chinese reduces English proficiency, but multilingual joint training can maintain English skills while learning new languages.

  • The study extends to 13 low-resource languages, confirming that instruction tuning enables effective language capability transfer across different linguistic contexts.

Introduction

Advances in LLMs have led to breakthroughs in tasks like reasoning, learning from experience, and following instructions. Yet, despite these advances, the overwhelming focus on English corpora has limited LLMs' abilities in other languages.

This study explore methods for transferring the capabilities of LLMs, specifically LLaMA, to non-English languages with minimal cost. Using extensive GPU resources, the research evaluates vocabulary extension, additional pretraining, and instruction tuning as key factors influencing the transfer process. Testing on both knowledge benchmarks and instruction-following tasks provides a holistic assessment of the model's language capabilities.

Analyzing Transfer Factors

The study reveals unexpected findings regarding vocabulary extension. Despite theories suggesting its usefulness, extending the vocabulary displays no clear advantage in transferring language capabilities. Surprisingly, vocabulary-extended models pre-trained with 30 billion tokens perform worse than LLaMA models trained on just 0.5 billion tokens.

In terms of training scales, the results indicate that for improving language generation capabilities like fluency and logical coherence, a substantial volume of further pretraining isn't as crucial as a significant amount of instruction tuning. However, in terms of model knowledge like factual accuracy, neither additional pretraining on Chinese nor expanding the vocabulary greatly impacts the LLaMA models' performance.

Maintaining English Proficiency

Another aspect considered is the impact of focused language transfer training on a LLM's original English capabilities. Models exclusively trained with Chinese data demonstrate a reduction in English proficiency, which suggests a trade-off between learning a new language and maintaining existing capabilities. The solution appears to lie in multilingual joint training, which helps preserve English skills while extending to new languages.

Expanding to Multiple Languages

This research also extends its findings beyond Chinese, encompassing 13 low-resource languages to validate the transfer process's effectiveness across diverse linguistic landscapes. The results are consistent, showcasing that the LLaMA model can quickly adapt to new languages with suitable instruction tuning, regardless of the resource level of the target language.

Conclusion and Implications

Overall, the study concludes that effective language capability transfer to non-English languages can be achieved with significantly less data than previously thought necessary. The research also underlines the internalized cross-lingual alignment in LLMs, observed through code-switching instances in the model's responses, which may play a role in the transferability of language capabilities. These insights have the potential to guide the development of more capable and efficient multilingual LLMs, lowering the barriers for languages with fewer resources.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube