Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

157 tokens/sec

GPT-4o

43 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

SambaLingo: Teaching Large Language Models New Languages (2404.05829v2)

Published 8 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

References (60)

Citations (5)

View on Semantic Scholar

Summary

The paper demonstrates that adapting LLMs through vocabulary expansion and targeted token initialization accelerates training convergence.
It shows that continuous pre-training with a balanced mix of English and target language data significantly enhances performance.
It finds that using minimal, translated alignment data effectively guides the model to meet human preferences in low-resource languages.

Comprehensive Study on Adapting LLMs to New Languages

Introduction to LLM Adaptation

The adaptation of pre-trained LLMs to new languages has emerged as a promising avenue for leveraging existing computational and data resources to extend the utility of these models across diverse linguistic landscapes. This paper presents an extensive paper focusing on various strategies for adapting LLMs to nine typologically diverse languages, including Arabic, Bulgarian, Hungarian, Japanese, Russian, Serbian, Slovenian, Thai, and Turkish. The research explores vocabulary extension, continuous pre-training, and methods for aligning models with human preferences in low-resource languages. Through meticulous experimentation, the paper sets new performance benchmarks, outperforming previous models in these languages across several dimensions.

Key Findings on LLM Adaptation

Vocabulary Expansion and Model Initialisation

The paper highlighted the significance of expanding the model's vocabulary to include tokens from the target language, which although did not substantially improve downstream task accuracy, enhanced tokenizer efficiency and inference performance in the target languages. Different strategies for initializing new token embeddings were explored, with the method of averaging sub-word embeddings showing accelerated convergence during training with minimal impact on final accuracy.

Continuous Pre-training with Mixed Language Data

The effectiveness of continuous pre-training was demonstrated through a methodology that involves training on a mixture of English and target language web data. The research indicates that including a higher proportion of target language data aids in achieving faster convergence and better performance in the target language, underscoring the importance of balanced and thoughtfully curated training corpora.

Human Preference Alignment with Limited Data

An innovative aspect of this paper is its approach to aligning models with human preferences using a minimal amount of alignment data. The findings suggest that a judicious mixture of translated alignment data can be nearly as effective as exclusively using data written in the target language for model alignment, thus mitigating the challenge of data scarcity in low-resource languages.

Quantitative Benchmarks and Evaluation

The adapted models were benchmarked against a suite of established multilingual and language-specific tests, showing superior performance over previous state-of-the-art models. Through rigorous evaluation, the adapted models demonstrated improvements in perplexity, translation quality, text classification, and natural language understanding tasks across all target languages. These results validate the effectiveness of the proposed adaptation methodology and underscore its potential as a scalable solution for enhancing the accessibility and utility of LLMs across a wider array of languages.

Future Directions in LLM Adaptation

This comprehensive paper not only advances our understanding of the processes involved in adapting LLMs to new languages but also sets the stage for future research in this area. The open sourcing of code and checkpoints is likely to stimulate further developments, enabling researchers to build upon the solid foundation laid by this work. Future endeavors may explore deeper into the nuances of language-specific model tuning, the exploration of even more languages, including those with non-Latin scripts and unique linguistic features, and the refinement of human preference alignment techniques to cater to diverse cultural and regional nuances.

Conclusion

In conclusion, this paper contributes significantly to the field of computational linguistics by providing a detailed protocol for the adaptation of LLMs to new languages, supported by empirical evidence of its efficacy across a wide range of linguistic tasks. By addressing key challenges such as vocabulary extension, training data scarcity, and the alignment with human preferences, this work paves the way for the development of more accessible, efficient, and versatile LLMs, democratizing the benefits of AI across linguistic boundaries.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1777880660666921267

https://twitter.com/ZoltanCsaki_/status/1800268005131292714

https://twitter.com/AnandSampat/status/1778174104644034841

https://twitter.com/gastronomy/status/1777911365157392402