Towards General Text Embeddings with Multi-stage Contrastive Learning

Published 7 Aug 2023 in cs.CL | (2308.03281v1)

Abstract: We present GTE, a general-purpose text embedding model trained with multi-stage contrastive learning. In line with recent advancements in unifying various NLP tasks into a single format, we train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources. By significantly increasing the number of training data during both unsupervised pre-training and supervised fine-tuning stages, we achieve substantial performance gains over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE$_\text{base}$ outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark. Furthermore, without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers of similar size by treating code as text. In summary, our model achieves impressive results by effectively harnessing multi-stage contrastive learning, offering a powerful and efficient text embedding model with broad applicability across various NLP and code-related tasks.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (212)

View on Semantic Scholar

Summary

The paper introduces the GTE model, which employs multi-stage contrastive learning to create unified text embeddings that generalize across various NLP tasks.
The methodology involves unsupervised pre-training on 800M text pairs followed by supervised fine-tuning on 3M pairs, effectively training a 110M parameter model.
Empirical results on the MTEB benchmark demonstrate that GTE outperforms larger models, including OpenAI’s ada-002, in tasks like semantic similarity and code search.

Towards General Text Embeddings with Multi-stage Contrastive Learning

The paper under review presents a comprehensive study on a general-purpose text embedding model, GTE, which is trained using a multi-stage contrastive learning approach. The authors emphasize the importance of unifying various NLP tasks into a single text embedding model capable of leveraging large datasets sourced from diverse domains. A significant outcome of this study is the advancement of the state-of-the-art in embedding models, as evidenced by the extensive empirical results.

Model Description and Training Strategy

The GTE model is developed as a unified framework for generating text embeddings using a relatively modest-sized model with 110M parameters, which is notably smaller than many contemporary models such as those from OpenAI. Despite its size, the GTE model competes effectively with and sometimes outperforms much larger models. The backbone of the GTE is a Transformer encoder typically initialized from pre-trained models such as BERT.

Training the GTE involves two primary stages. The first, unsupervised pre-training, focuses on harnessing a wide range of weakly supervised text pairs sourced from publicly available datasets like CommonCrawl, scientific papers, Reddit, and GitHub, accumulating approximately 800M text pairs. The second stage encompasses supervised fine-tuning on a collection of datasets that are largely derived from previous endeavors, summing up to about 3M pairs. By employing multi-stage contrastive learning, the authors have refined an objective that efficiently makes use of the broad dataset to generalize across multiple NLP contexts, from semantic textual similarity to complex code search tasks.

Key Empirical Findings

The authors report that GTE attains high levels of performance across multiple benchmarks. Notably, when evaluated on the Massive Text Embedding Benchmark (MTEB), which comprises 56 diverse datasets, GTE demonstrates superiority over OpenAI’s commercial embedding model and several task-specific larger models in a variety of tasks including zero-shot text classification, text retrieval, and semantic textual similarity.

On the Massive Text Embedding Benchmark (MTEB), GTE-Base achieved an average score of 62.4, surpassing several other models including OpenAI's ada-002 and InstructOR-Base. In code search, GTE was also highly effective. Even without task-specific tuning for each programming language, it showcased enhanced performance against state-of-the-art baseline models such as CodeBERT and CodeRetriever.

Implications and Future Research

The presented work suggests a few compelling implications for NLP research and practice. Firstly, GTE's performance demonstrates that employing a broad range of data sources for pre-training can yield embeddings that rival those produced by larger models focused on specific domains. This includes not just text-based tasks but also bridging into code-related applications through generalized representations.

The multi-stage contrastive learning approach detailed in this paper opens new avenues for developing further compact, efficient embeddings that do not compromise on performance. These findings could potentially drive the development of versatile, lightweight models in real-world applications requiring robustness across diverse tasks.

For future exploration, it would be interesting to investigate how similar techniques could be applied to multilingual and multi-modal models, furthering the reach of such general-purpose frameworks. Additionally, continuing to refine data sampling and contrastive loss functions may optimize training efficiency and performance even further.

In conclusion, this study consolidates the efficacy of a multi-stage contrastive learning paradigm and provides a robust baseline for the research community to build upon in text embedding generation. The GTE model encapsulates a scalable, efficient approach that is set to influence the trajectory of research in unified text and code representation learning.

Markdown Report Issue