Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

110 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

3.1k 48 5

Improving Text Embeddings with Large Language Models (2401.00368v3)

Published 31 Dec 2023 in cs.CL and cs.IR

Abstract: In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.

References (52)

Authors (6)

Liang Wang (512 papers)
Nan Yang (182 papers)
Xiaolong Huang (29 papers)
Linjun Yang (16 papers)
Rangan Majumder (12 papers)
Furu Wei (291 papers)

Citations (102)

View on Semantic Scholar

Summary

The paper demonstrates a novel approach using LLMs to generate synthetic training data for robust text embeddings.
The methodology fine-tunes decoder-only LLMs with contrastive loss, achieving benchmark-level performance in under 1k training steps.
The findings indicate improved multilingual support and extended context handling, setting new records on key benchmarks.

Introduction

Text embeddings are compact vector representations designed to capture the semantic essence of textual content, facilitating their use in a variety of natural language processing tasks. These tasks include information retrieval, machine translation, and semantic analysis, where retrieval efficiency and accuracy greatly depend on the quality of these embeddings. Traditional methods for learning text embeddings often involve complex pipelines and multistage training on large volumes of weakly labeled data, followed by fine-tuning on more refined datasets.

Novel Approach to Text Embeddings

In contrast to these multilayered processes, this paper introduces a new, streamlined method that leverages LLMs to create text embeddings with competitive performance across numerous tasks and languages without the need for labeled training data. This approach generates synthetic data through a combination of brainstorming and generation from LLMs, enabling a variety of language-types and tasks to be covered. Decoder-only LLMs like Mistral are then fine-tuned using this synthetic data with a standard contrastive loss, yielding robust results.

Experiments and Findings

Experiments show that this fine-tuned model, Mistral-7B, achieves impressive results when compared to state-of-the-art on benchmarks like BEIR and MTEB using only synthetic data. When incorporating a mix of synthetic and labeled data, the performance is further elevated, establishing new records on these benchmarks with just under 1k training steps. Furthermore, the model shows potential for handling extended context lengths and multilingual representation, although it highlights a need for more diverse pre-training for low-resource languages.

Conclusion and Future Work

This paper underscores the potential to significantly enhance text embeddings by utilizing LLMs to generate synthetic data, thereby simplifying and expediting the training process. While high-resource languages benefit most from the approach, future research could expand the model's multilingual capabilities and efficiency, potentially even forgoing the reliance on proprietary LLMs for synthetic data generation.

Tweets

https://twitter.com/2465283662/status/1742034428619096327

https://twitter.com/634339745/status/1742640557422268834

https://twitter.com/207744565/status/1742555240363074000

https://twitter.com/atroyn/status/1796232436436852818

https://twitter.com/_philschmid/status/1748647992179998957

https://twitter.com/794433401591693312/status/1742012028439474380

YouTube

Show All Videos

HackerNews

Improving Text Embeddings with Large Language Models (48 points, 6 comments)