Emergent Mind

Abstract

Language Models (LMs) excel in natural language processing tasks for English but show reduced performance in most other languages. This problem is commonly tackled by continually pre-training and fine-tuning these models for said languages. A significant issue in this process is the limited vocabulary coverage in the original model's tokenizer, leading to inadequate representation of new languages and necessitating an expansion of the tokenizer. The initialization of the embeddings corresponding to new vocabulary items presents a further challenge. Current strategies require cross-lingual embeddings and lack a solid theoretical foundation as well as comparisons with strong baselines. In this paper, we first establish theoretically that initializing within the convex hull of existing embeddings is a good initialization, followed by a novel but simple approach, Constrained Word2Vec (CW2V), which does not require cross-lingual embeddings. Our study evaluates different initialization methods for expanding RoBERTa and LLaMA 2 across four languages and five tasks. The results show that CW2V performs equally well or even better than more advanced techniques. Additionally, simpler approaches like multivariate initialization perform on par with these advanced methods indicating that efficient large-scale multilingual continued pretraining can be achieved even with simpler initialization methods.

Performance of various initialization methods on expanded RoBERTa models across multilingual tasks and languages.

Overview

  • The paper explores methods to adapt pre-trained language models like RoBERTa and LLaMA2 for multiple languages by expanding their vocabulary and initializing new embeddings effectively.

  • It introduces Constrained Word2Vec (CW2V), a novel initialization method that avoids the need for cross-lingual embeddings, and compares it against five other strategies across various models, languages, and tasks.

  • Results show that CW2V often outperforms more complex methods in multilingual and generative tasks for LLaMA2, highlighting the efficacy of simpler, theoretically sound techniques for embedding initialization.

Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models

The paper "An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models" explores how to adapt pre-trained language models (LMs) like RoBERTa and LLaMA2 to accommodate multiple languages by expanding their vocabulary and initializing the new embeddings effectively. This study addresses two primary challenges: enhancing the tokenizer’s vocabulary to better capture different languages and effectively initializing the embeddings for these new vocabulary items.

Problem Context and Objectives

Language models, although highly effective for English, often underperform for other languages due to limited vocabulary coverage in the tokenizer. The authors tackle this issue by introducing new vocabulary items and investigating various initialization strategies for these new embeddings. Current embedding initialization techniques, such as cross-lingual embeddings, although prevalent, lack a strong theoretical basis and comparative analysis against simpler methods. This paper aims to bridge this gap by providing a theoretical foundation for good initialization strategies and introducing Constrained Word2Vec (CW2V), a novel method that forgoes the need for cross-lingual embeddings.

Methodological Approach

The authors establish theoretically that embeddings within the convex hull of the existing embeddings provide a good initialization. They introduce CW2V, which enforces this constraint without needing cross-lingual data. The CW2V method learns the target embeddings by transforming the source embeddings through a weight matrix $W$, ensuring the new embeddings are within the convex hull of the source embeddings. The study compares CW2V against five initialization strategies, namely OFA, Multivariate, Univariate, Mean, and Random initialization, across two models (RoBERTa and LLaMA2), four languages (Hindi, Tamil, Russian, German), and five tasks (XNLI, NER, QA, MT, XLSUM).

Experimental Setup

The tokenizer expansion involved creating a unified target tokenizer using SentencePiece, merging subwords from target languages with those from the LLaMA2 tokenizer, ensuring significant reduction in fertility scores for the target languages, thus facilitating a better representation. The CW2V and other baseline initializations were evaluated both before and after continual pre-training (CPT) for various downstream tasks using different metrics.

Results

Initial findings indicate that CW2V preserves the pre-expansion performance for English tasks better than other methods for both RoBERTa and LLaMA2. Specifically, for RoBERTa, CW2V is on par with or slightly inferior to OFA in multilingual tasks, but significantly better in preserving English performance. For LLaMA2, CW2V outperforms OFA across most multilingual and generative tasks while showing comparable performance on English tasks.

With CPT, both RoBERTa and LLaMA2 models initialized via CW2V converge swiftly, aligning with and sometimes surpassing the performance of more sophisticated methods like OFA. Interestingly, simpler methods such as Multivariate initialization, which inherently lie within the convex hull with high probability, also demonstrate competitive performance, suggesting that a strong initialization strategy does not always require complex methodologies.

Implications

Theoretical findings highlighting the importance of convex hull properties in embedding initialization underscore the potential for simpler, computationally cheaper methods to be equally effective for multilingual adaptation. Practically, employing methods like CW2V can significantly improve adaptation performance for different languages in a pre-trained LM without extensive cross-lingual resources. However, initial phases of CPT tend to adversely affect performance on English tasks, necessitating prolonged training to mitigate this effect, hinting at a delicate balance between multilingual adaptation and the preservation of the source language capabilities.

Future Directions

This study lays a groundwork for further research into more efficient and theoretically grounded initialization strategies. Future work can explore broader sets of languages and tasks, finer adjustments to CPT to better balance source and target language performance, and the potential integration of these findings with dynamic vocabulary adaptation methods.

By addressing the inherent challenges in vocabulary expansion and providing empirical evidence for the efficacy of simple, yet theoretically sound methods, this paper contributes significantly to the ongoing development of robust, multilingual language models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.