Emergent Mind

Exploring Design Choices for Building Language-Specific LLMs

(2406.14670)
Published Jun 20, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

Despite rapid progress in LLMs, their performance on a vast majority of languages remain unsatisfactory. In this paper, we study building language-specific LLMs by adapting monolingual and multilingual LLMs. We conduct systematic experiments on how design choices (base model selection, vocabulary extension, and continued fine-tuning) impact the adapted LLM, both in terms of efficiency (how many tokens are needed to encode the same amount of information) and end task performance. We find that (1) the initial performance before the adaptation is not always indicative of the final performance. (2) Efficiency can easily improved with simple vocabulary extension and continued fine-tuning in most LLMs we study, and (3) The optimal adaptation method is highly language-dependent, and the simplest approach works well across various experimental settings. Adapting English-centric models can yield better results than adapting multilingual models despite their worse initial performance on low-resource languages. Together, our work lays foundations on efficiently building language-specific LLMs by adapting existing LLMs.

Process of adapting a base language model to a specific language through token augmentation and pre-training.

Overview

  • The paper investigates the adaptation of existing monolingual and multilingual LLMs to build language-specific models.

  • It finds that simple interventions like vocabulary extension and continued fine-tuning can significantly enhance the efficiency and performance of LLMs for low-resource languages.

  • Experimental results show that, for certain tasks, adapted English-centric models can outperform some multilingual models, with the optimal adaptation strategy varying by target language.

Exploring Design Choices for Building Language-Specific LLMs

Overview

This paper explore the challenge of building language-specific LLMs by adapting existing monolingual and multilingual LLMs. While substantial advancements have been made in the performance of LLMs, their efficacy across a wide range of languages remains notably lacking. This study systematically investigates how various design choices—namely, base model selection, vocabulary extension, and continued fine-tuning—affect both the efficiency and the end-task performance of adapted LLMs.

Key Findings

  1. Non-Indicative Initial Performance: The performance of a base LLM before adaptation does not necessarily predict its performance post-adaptation. For instance, monolingual English-centric models, despite initially poor performance on low-resource languages, can be adapted effectively through appropriate training.
  2. Efficiency Gains: Simple vocabulary extension and continued fine-tuning can enhance the efficiency of most LLMs. For example, a vocabulary extension of around 10K tokens can bridge the efficiency gap between English and low-resource languages, defined by the number of tokens needed to encode equivalent information.
  3. Language-Specific vs. Multilingual Models: The optimal adaptation strategy is highly dependent on the target language. Contrary to expectations, adapting English-centric models can yield superior results compared to adapting certain multilingual models.

The study contributes foundational insights into the efficient adaptation of existing LLMs to build effective language-specific models.

Methodology

Token Vocabulary Augmentation

  1. Token Generation: The authors trained BPE sentencepiece tokenizers on 300K examples from the mC4 corpus in the target language, generating a language-specific vocabulary of varying sizes (1K to 50K tokens).
  2. Vocabulary Merging: The new tokens from the target language were appended to the base vocabulary of the original model, ensuring that all tokens from the initial vocabulary were retained.

Continued Pre-Training (CPT)

  1. Embedding Initialization: The new tokens' embeddings were initialized as the mean of their constituent token embeddings. This simple initialization performed competitively with more complex strategies.
  2. Training Details: The models underwent continued training on a large corpus of the target language, with parameters optimized to maximize multilingual efficacy.

Experimental Evaluation

The authors conducted extensive experiments to evaluate the impact of adaptation strategies on four diverse languages—Hindi, Turkish, Arabic, and Tamil—using several benchmarks:

  1. Machine Translation: Assessed with the FLORES-200 benchmark via 5-shot prompting.
  2. Text Summarization: Evaluated on the XL-Sum dataset by re-framing the task.
  3. Understanding Tasks: Various benchmarks such as mLAMA for knowledge probing, XNLI for natural language inference, sentiment analysis, and commonsense reasoning tasks like XCOPA and XStoryCloze.

Results

  1. Efficiency: Vocabulary augmentation significantly reduced the token disparity between English and other languages. For instance, extending LLaMA-2 with 10K tokens drastically improved fertility, reducing the sequence length needed to encode the same amount of information.
  2. End Task Performance: Adapted LLaMA-2-7B matched the performance of larger multilingual models on several tasks. Moreover, smaller multilingual models, such as Gemma-2B, demonstrated competitive performance post-adaptation.
  3. Optimal Design Choices: The study finds that the balance between the size of the vocabulary extension and the amount of fine-tuning data is crucial. For instance, smaller vocabulary extensions yielded better performance with less continued training data, while larger vocabulary sizes required more data to achieve substantial performance gains.

Implications and Future Directions

This research highlights the need for pragmatic adaptation strategies tailored to linguistic diversity. By demonstrating how simple adaptations can significantly enhance model efficiency and performance, the study sets the stage for more inclusive, multilingual AI technologies. Future research could expand in several directions:

  1. Broader Language Coverage: Extending the study to more languages, especially under-resourced ones, would help generalize these findings.
  2. Refining Tokenization Approaches: Exploring advanced tokenization methods beyond BPE could yield further efficiency gains.
  3. Cross-Lingual Knowledge Transfer: Investigating the mechanisms of knowledge transfer in multilingual settings could enhance the effectiveness of CPT.

In conclusion, this study provides pivotal insights into the adaptation of LLMs for specific languages, laying the groundwork for creating more efficient and performant language-specific LLMs.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.