Emergent Mind

Improving Text Embeddings with Large Language Models

(2401.00368)
Published Dec 31, 2023 in cs.CL and cs.IR

Abstract

In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.

Task type and language statistics in the synthetic data generated, as detailed in their paper.

Overview

  • The paper presents a new method for creating text embeddings using LLMs without needing labeled data.

  • The proposed method involves generating synthetic data through brainstorming and generation by LLMs and fine-tuning decoder-only LLMs like Mistral.

  • Experiments with the new model, Mistral-7B, show high performance on benchmarks like BEIR and MTEB using synthetic data alone.

  • When combining synthetic and labeled data, the model achieves new records with fewer than 1k training steps.

  • The paper suggests room for improvement in multilingual capabilities and reducing reliance on proprietary LLMs.

Introduction

Text embeddings are compact vector representations designed to capture the semantic essence of textual content, facilitating their use in a variety of natural language processing tasks. These tasks include information retrieval, machine translation, and semantic analysis, where retrieval efficiency and accuracy greatly depend on the quality of these embeddings. Traditional methods for learning text embeddings often involve complex pipelines and multistage training on large volumes of weakly labeled data, followed by fine-tuning on more refined datasets.

Novel Approach to Text Embeddings

In contrast to these multilayered processes, this paper introduces a new, streamlined method that leverages LLMs to create text embeddings with competitive performance across numerous tasks and languages without the need for labeled training data. This approach generates synthetic data through a combination of brainstorming and generation from LLMs, enabling a variety of language-types and tasks to be covered. Decoder-only LLMs like Mistral are then fine-tuned using this synthetic data with a standard contrastive loss, yielding robust results.

Experiments and Findings

Experiments show that this fine-tuned model, Mistral-7B, achieves impressive results when compared to state-of-the-art on benchmarks like BEIR and MTEB using only synthetic data. When incorporating a mix of synthetic and labeled data, the performance is further elevated, establishing new records on these benchmarks with just under 1k training steps. Furthermore, the model shows potential for handling extended context lengths and multilingual representation, although it highlights a need for more diverse pre-training for low-resource languages.

Conclusion and Future Work

This paper underscores the potential to significantly enhance text embeddings by utilizing LLMs to generate synthetic data, thereby simplifying and expediting the training process. While high-resource languages benefit most from the approach, future research could expand the model's multilingual capabilities and efficiency, potentially even forgoing the reliance on proprietary LLMs for synthetic data generation.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews