- The paper introduces a method that replaces low-frequency tokenizer tokens with target language tokens, reducing fertility costs by up to 73%.
- It employs strategic data mixing during pretraining to mitigate catastrophic forgetting while preserving English performance.
- Experimental results demonstrate significant performance improvements in Hungarian and Thai across multiple language tasks.
Efficiently Adapting Pretrained LLMs to New Languages
Introduction
The paper "Efficiently Adapting Pretrained LLMs to New Languages" (2311.05741) addresses the issue of adapting existing pretrained LLMs for low-resource languages. Given that LLMs are typically trained predominantly on English data, they often exhibit sub-optimal performance on languages with less available data. The paper proposes methods to efficiently adapt LLMs to new languages without the pitfalls of catastrophic forgetting and poor tokenizer efficiency, focusing primarily on Hungarian and Thai.
Problem Statement
LLMs exhibit superior performance in multilingual tasks due to their ability to transfer cross-lingual knowledge. However, their efficacy diminishes significantly with low-resource languages, primarily due to a lack of high-quality training data and computational resources required for training models from scratch. The paper seeks to enhance tokenizer efficiency and prevent catastrophic forgetting while adapting existing English-centric models to new languages. Byte Pair Encoding (BPE) tokenizers are used, but they encode text poorly if not specifically trained on the target language, leading to increased computational costs.
Methodology
Tokenizer Efficiency
The authors propose enhancing tokenizer efficiency by incorporating tokens specific to the target language into the existing vocabulary. This is done by replacing the least frequent tokens in the base model's vocabulary with new tokens from the target language rather than extending the vocabulary size, thereby preserving model capacity. This approach effectively reduces the average number of tokens per word—known as fertility—by substantial margins in both Hungarian and Thai, leading to reduced computational costs in training and inference.
Mitigating Catastrophic Forgetting
To prevent catastrophic forgetting, a strategic data mixing approach is employed during both continuous pretraining and instruction-tuning stages. The mixture involves incorporating a balanced amount of data from both the original and new languages, enhancing the model's performance in the target language while retaining capabilities in English.
Experimental Results
The adaptation of English models to Hungarian and Thai demonstrated improved performance metrics compared to current open-source models for these languages, particularly in tasks such as multiple choice, open-ended question answering, summarization, and translation. The experiments show that replacing approximately 10% of the tokenizer's tokens results in a significant fertility improvement—42% for Hungarian and 73% for Thai—while preserving English performance. Furthermore, integrating training data from both languages helps maintain English proficiency while improving results in the target languages.
Implications and Future Directions
The approach proposed in the paper holds practical implications for enhancing LLM performance in underrepresented languages using existing pretrained models. This not only demonstrates efficiency in terms of computational resource savings but also provides a template for translating these methods to other low-resource languages. Future research could explore scaling this approach further by testing additional languages and examining its impact on other types of language tasks.
Conclusion
In summary, the paper provides an effective framework for adapting pretrained LLMs to new languages, focusing on optimizing tokenizer efficiency and preventing catastrophic forgetting through strategic data mixing. The results indicate that these methods can significantly enhance performance in low-resource languages without detrimental effects on the original language capabilities, marking a step towards more inclusive multilingual LLMs.