Papers
Topics
Authors
Recent
2000 character limit reached

Efficiently Adapting Pretrained Language Models To New Languages (2311.05741v2)

Published 9 Nov 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Recent LLMs (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train models for low-resource languages, especially from scratch, due to a lack of high quality training data. Adapting pretrained LLMs reduces the need for data in the new language while also providing cross lingual transfer capabilities. However, naively adapting to new languages leads to catastrophic forgetting and poor tokenizer efficiency. In this work, we study how to efficiently adapt any existing pretrained LLM to a new language without running into these issues. In particular, we improve the encoding efficiency of the tokenizer by adding new tokens from the target language and study the data mixing recipe to mitigate forgetting. Our experiments on adapting an English LLM to Hungarian and Thai show that our recipe can reach better performance than open source models on the target language, with minimal regressions on English.

Citations (11)

Summary

  • The paper introduces a method that replaces low-frequency tokenizer tokens with target language tokens, reducing fertility costs by up to 73%.
  • It employs strategic data mixing during pretraining to mitigate catastrophic forgetting while preserving English performance.
  • Experimental results demonstrate significant performance improvements in Hungarian and Thai across multiple language tasks.

Efficiently Adapting Pretrained LLMs to New Languages

Introduction

The paper "Efficiently Adapting Pretrained LLMs to New Languages" (2311.05741) addresses the issue of adapting existing pretrained LLMs for low-resource languages. Given that LLMs are typically trained predominantly on English data, they often exhibit sub-optimal performance on languages with less available data. The paper proposes methods to efficiently adapt LLMs to new languages without the pitfalls of catastrophic forgetting and poor tokenizer efficiency, focusing primarily on Hungarian and Thai.

Problem Statement

LLMs exhibit superior performance in multilingual tasks due to their ability to transfer cross-lingual knowledge. However, their efficacy diminishes significantly with low-resource languages, primarily due to a lack of high-quality training data and computational resources required for training models from scratch. The paper seeks to enhance tokenizer efficiency and prevent catastrophic forgetting while adapting existing English-centric models to new languages. Byte Pair Encoding (BPE) tokenizers are used, but they encode text poorly if not specifically trained on the target language, leading to increased computational costs.

Methodology

Tokenizer Efficiency

The authors propose enhancing tokenizer efficiency by incorporating tokens specific to the target language into the existing vocabulary. This is done by replacing the least frequent tokens in the base model's vocabulary with new tokens from the target language rather than extending the vocabulary size, thereby preserving model capacity. This approach effectively reduces the average number of tokens per word—known as fertility—by substantial margins in both Hungarian and Thai, leading to reduced computational costs in training and inference.

Mitigating Catastrophic Forgetting

To prevent catastrophic forgetting, a strategic data mixing approach is employed during both continuous pretraining and instruction-tuning stages. The mixture involves incorporating a balanced amount of data from both the original and new languages, enhancing the model's performance in the target language while retaining capabilities in English.

Experimental Results

The adaptation of English models to Hungarian and Thai demonstrated improved performance metrics compared to current open-source models for these languages, particularly in tasks such as multiple choice, open-ended question answering, summarization, and translation. The experiments show that replacing approximately 10% of the tokenizer's tokens results in a significant fertility improvement—42% for Hungarian and 73% for Thai—while preserving English performance. Furthermore, integrating training data from both languages helps maintain English proficiency while improving results in the target languages.

Implications and Future Directions

The approach proposed in the paper holds practical implications for enhancing LLM performance in underrepresented languages using existing pretrained models. This not only demonstrates efficiency in terms of computational resource savings but also provides a template for translating these methods to other low-resource languages. Future research could explore scaling this approach further by testing additional languages and examining its impact on other types of language tasks.

Conclusion

In summary, the paper provides an effective framework for adapting pretrained LLMs to new languages, focusing on optimizing tokenizer efficiency and preventing catastrophic forgetting through strategic data mixing. The results indicate that these methods can significantly enhance performance in low-resource languages without detrimental effects on the original language capabilities, marking a step towards more inclusive multilingual LLMs.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.