Emergent Mind

Text Embeddings by Weakly-Supervised Contrastive Pre-training

(2212.03533)
Published Dec 7, 2022 in cs.CL and cs.IR

Abstract

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

Overview

  • A new method for creating high-quality text embeddings utilizes contrastive pre-training with weak supervision.

  • The dataset CCPairs is introduced, comprising a large-scale collection of text pairs for contrastive learning.

  • The model, E5, trained on CCPairs shows superior performance on benchmarks BEIR and MTEB without labeled data.

  • E5 outperforms larger models even when fine-tuned with a blend of datasets that introduce human knowledge.

  • E5 proves to be versatile for text retrieval and classification tasks, but the potential of self-supervision in text embeddings still requires investigation.

Introduction

The development of text embeddings has been instrumental in advancing the field of natural language processing. These embeddings represent text as low-dimensional vectors, facilitating efficient retrieval and matching between texts, which are widely applicable in retrieval, clustering, and classification tasks. Although pre-trained language models such as BERT and GPT are capable of producing transferable text representations, they are suboptimal for single-vector embedding tasks. The paper introduces a new approach for generating high-quality text embeddings through contrastive pre-training with weak supervision.

Data Curation and Methodology

The cornerstone of this approach is the dataset, termed CCPairs—a curated collection of large-scale text pairs, which have been extracted from semi-structured web sources and filtered for quality using a consistency-based approach. This dataset enables contrastive learning, where the model distinguishes between relevant text pairs and numerous irrelevant counterparts within a large batch of examples. By leveraging weak supervision from heterogeneous sources like CommunityQA, Common Crawl, and scientific papers, the model, E5, is trained contrastively using in-batch negatives.

Model Performance

E5's performance is rigorously evaluated on the BEIR and MTEB benchmarks. Remarkably, without relying on any labeled data, E5 outperforms the strong BM25 baseline on the BEIR zero-shot retrieval benchmark. When fine-tuned with labeled data, its effectiveness further escalates, surpassing other embedding models with significantly more parameters. The fine-tuning method involves a blend of datasets that impart human knowledge into the model, refining its capacity for superior text embedding tasks.

Applications and Analysis

The core contribution, the E5 model, exhibits versatility and efficiency, catering to tasks demanding single-vector text representations. It is beneficial for zero-shot retrieval, few-shot and zero-shot text classification, semantic textual similarity, and text clustering. In summary, E5 sets a new precedent for general-purpose text embeddings, suitable for a vast array of applications and demonstrating empirical gains despite having fewer parameters compared to some of the larger models available. However, questions surrounding the ability to achieve state-of-the-art embeddings solely from self-supervision remain open for exploration.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.