Emergent Mind

SambaLingo: Teaching Large Language Models New Languages

(2404.05829)
Published Apr 8, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

Comparison of perplexity between proprietary and open source multilingual models, including Japanese Swallow-7b-hf.

Overview

  • The paper presents an in-depth study on adapting LLMs to new languages, focusing on strategies like vocabulary extension, continuous pre-training, and human preference alignment.

  • Exploration of adapting LLMs to nine typologically diverse languages demonstrates new performance benchmarks, surpassing previous models in efficiency and task accuracy.

  • A novel approach for model alignment with human preferences using minimal data is introduced, illustrating the effectiveness of translated alignment data in low-resource languages.

  • Empirical results underscore the method's potential in making LLMs more accessible and efficient across various languages, proposing future directions for language-specific tuning and broader language inclusion.

Comprehensive Study on Adapting LLMs to New Languages

Introduction to Language Model Adaptation

The adaptation of pre-trained LLMs to new languages has emerged as a promising avenue for leveraging existing computational and data resources to extend the utility of these models across diverse linguistic landscapes. This paper presents an extensive study focusing on various strategies for adapting LLMs to nine typologically diverse languages, including Arabic, Bulgarian, Hungarian, Japanese, Russian, Serbian, Slovenian, Thai, and Turkish. The research explores vocabulary extension, continuous pre-training, and methods for aligning models with human preferences in low-resource languages. Through meticulous experimentation, the study sets new performance benchmarks, outperforming previous models in these languages across several dimensions.

Key Findings on Language Model Adaptation

Vocabulary Expansion and Model Initialisation

The study highlighted the significance of expanding the model's vocabulary to include tokens from the target language, which although did not substantially improve downstream task accuracy, enhanced tokenizer efficiency and inference performance in the target languages. Different strategies for initializing new token embeddings were explored, with the method of averaging sub-word embeddings showing accelerated convergence during training with minimal impact on final accuracy.

Continuous Pre-training with Mixed Language Data

The effectiveness of continuous pre-training was demonstrated through a methodology that involves training on a mixture of English and target language web data. The research indicates that including a higher proportion of target language data aids in achieving faster convergence and better performance in the target language, underscoring the importance of balanced and thoughtfully curated training corpora.

Human Preference Alignment with Limited Data

An innovative aspect of this study is its approach to aligning models with human preferences using a minimal amount of alignment data. The findings suggest that a judicious mixture of translated alignment data can be nearly as effective as exclusively using data written in the target language for model alignment, thus mitigating the challenge of data scarcity in low-resource languages.

Quantitative Benchmarks and Evaluation

The adapted models were benchmarked against a suite of established multilingual and language-specific tests, showing superior performance over previous state-of-the-art models. Through rigorous evaluation, the adapted models demonstrated improvements in perplexity, translation quality, text classification, and natural language understanding tasks across all target languages. These results validate the effectiveness of the proposed adaptation methodology and underscore its potential as a scalable solution for enhancing the accessibility and utility of LLMs across a wider array of languages.

Future Directions in LLM Adaptation

This comprehensive study not only advances our understanding of the processes involved in adapting LLMs to new languages but also sets the stage for future research in this area. The open sourcing of code and checkpoints is likely to stimulate further developments, enabling researchers to build upon the solid foundation laid by this work. Future endeavors may explore deeper into the nuances of language-specific model tuning, the exploration of even more languages, including those with non-Latin scripts and unique linguistic features, and the refinement of human preference alignment techniques to cater to diverse cultural and regional nuances.

Conclusion

In conclusion, this paper contributes significantly to the field of computational linguistics by providing a detailed protocol for the adaptation of LLMs to new languages, supported by empirical evidence of its efficacy across a wide range of linguistic tasks. By addressing key challenges such as vocabulary extension, training data scarcity, and the alignment with human preferences, this work paves the way for the development of more accessible, efficient, and versatile language models, democratizing the benefits of AI across linguistic boundaries.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.