Text Embeddings by Weakly-Supervised Contrastive Pre-training (2212.03533v2)

Published 7 Dec 2022 in cs.CL and cs.IR

Abstract: This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

Citations (406)

View on Semantic Scholar

Summary

The paper presents a novel contrastive pre-training method using weak supervision from the CCPairs dataset for high-quality text embeddings.
It evaluates the E5 model on BEIR and MTEB benchmarks, showing superior zero-shot retrieval performance compared to BM25.
Fine-tuning with heterogeneous labeled data further improves E5, making it an effective tool for various NLP tasks like retrieval and classification.

Introduction

The development of text embeddings has been instrumental in advancing the field of natural language processing. These embeddings represent text as low-dimensional vectors, facilitating efficient retrieval and matching between texts, which are widely applicable in retrieval, clustering, and classification tasks. Although pre-trained LLMs such as BERT and GPT are capable of producing transferable text representations, they are suboptimal for single-vector embedding tasks. The paper introduces a new approach for generating high-quality text embeddings through contrastive pre-training with weak supervision.

Data Curation and Methodology

The cornerstone of this approach is the dataset, termed CCPairs—a curated collection of large-scale text pairs, which have been extracted from semi-structured web sources and filtered for quality using a consistency-based approach. This dataset enables contrastive learning, where the model distinguishes between relevant text pairs and numerous irrelevant counterparts within a large batch of examples. By leveraging weak supervision from heterogeneous sources like CommunityQA, Common Crawl, and scientific papers, the model, E5, is trained contrastively using in-batch negatives.

Model Performance

E5's performance is rigorously evaluated on the BEIR and MTEB benchmarks. Remarkably, without relying on any labeled data, E5 outperforms the strong BM25 baseline on the BEIR zero-shot retrieval benchmark. When fine-tuned with labeled data, its effectiveness further escalates, surpassing other embedding models with significantly more parameters. The fine-tuning method involves a blend of datasets that impart human knowledge into the model, refining its capacity for superior text embedding tasks.

Applications and Analysis

The core contribution, the E5 model, exhibits versatility and efficiency, catering to tasks demanding single-vector text representations. It is beneficial for zero-shot retrieval, few-shot and zero-shot text classification, semantic textual similarity, and text clustering. In summary, E5 sets a new precedent for general-purpose text embeddings, suitable for a vast array of applications and demonstrating empirical gains despite having fewer parameters compared to some of the larger models available. However, questions surrounding the ability to achieve state-of-the-art embeddings solely from self-supervision remain open for exploration.

PDF Markdown

Tweets

https://twitter.com/bo_wangbo/status/1847171828906655803

YouTube

Show All Videos