Emergent Mind

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

(2405.17428)
Published May 27, 2024 in cs.CL , cs.AI , cs.IR , and cs.LG

Abstract

Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model with a variety of architectural designs and training procedures to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last <EOS> token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For model training, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval datasets into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. Combining these techniques, our NV-Embed model, using only publicly available data, has achieved a record-high score of 69.32, ranking No. 1 on the Massive Text Embedding Benchmark (MTEB) (as of May 24, 2024), with 56 tasks, encompassing retrieval, reranking, classification, clustering, and semantic textual similarity tasks. Notably, our model also attains the highest score of 59.36 on 15 retrieval tasks in the MTEB benchmark (also known as BEIR). We will open-source the model at: https://huggingface.co/nvidia/NV-Embed-v1.

Proposed architecture: decoder-only LLM with latent attention layer and MLP, showing QKV-attentions.

Overview

  • The NV-Embed model introduces architectural advancements, such as a latent attention layer, which improves text embedding tasks by capturing complex structures within input sequences.

  • The model employs a two-stage contrastive instruction-tuning approach for training, enhancing performance across retrieval and non-retrieval tasks without relying on proprietary datasets.

  • NV-Embed achieves top scores on benchmarks like MTEB and BEIR, surpassing prior leading models and demonstrating its effectiveness using publicly available data.

The NV-Embed Model: Advancements in Decoder-Only LLMs for Text Embedding Tasks

This paper presents the NV-Embed model, which pushes the boundaries of decoder-only LLMs as versatile text embedding models. Current trends in text embedding have shown that decoder-only LLM-based models have begun to outperform traditional bidirectional models such as BERT and T5 in general-purpose text embedding tasks, including dense vector-based retrieval. NV-Embed incorporates these advancements with novel architectural and training modifications to achieve better performance while maintaining simplicity and reproducibility.

Architectural Innovations

A prominent feature of NV-Embed is the introduction of a latent attention layer specifically designed to extract pooled embeddings from sequences of tokens. In traditional approaches, embeddings are obtained using either mean pooling or the embedding of the last <EOS> token. Both of these methods have their limitations: mean pooling might dilute crucial semantic information distributed across the token sequence, whereas the last token embedding could suffer from recency bias. The latent attention layer proposed in NV-Embed mitigates these issues by employing a form of cross-attention where the hidden states serve as queries and the keys and values come from a trainable latent array. This setup enables the model to better capture and represent the complex structure of the input sequence.

Another significant architectural choice is the removal of the causal attention mask during contrastive training. While the causal mask in decoder-only LLMs is essential for preserving autoregressive properties in generation tasks, it limits the model's ability to fully utilize bidirectional context when learning embeddings. By simply eliminating the causal mask, NV-Embed harnesses the full potential of bidirectional attention, which enhances representation learning without the need for additional complex training phases, as seen in related works.

Training Enhancements

On the training side, NV-Embed employs a two-stage contrastive instruction-tuning approach. The first stage focuses on retrieval datasets, applying contrastive training with in-batch negatives and curated hard negatives. After an initial contrastive training phase, the model undergoes a second stage blending various non-retrieval datasets into instruction tuning. This stage is designed to improve not only non-retrieval tasks such as classification, clustering, and semantic textual similarity (STS) but also to provide unexpected gains in retrieval performance.

This two-stage methodology is distinct from previous models, which often do not separate the stages based on task difficulties or characteristics but rather apply a unified training strategy across all tasks. Through these meticulous design choices, NV-Embed delivers a model that achieves remarkable scores in diverse embedding benchmarks without relying on proprietary synthetic data, underscoring its reproducibility using entirely public datasets.

Empirical Results

The empirical evaluation of NV-Embed demonstrates its competitive edge. Achieving a top score of 69.32 on the Massive Text Embedding Benchmark (MTEB), which includes 56 tasks ranging from retrieval and reranking to classification and semantic similarity, it surpasses prior leading models like E5-Mistral-7B-Instruct and SFR-Embedding. Notably, NV-Embed sets a new record of 59.36 on the BEIR retrieval benchmark, indicative of its superior ability in dense vector-based retrieval tasks.

Implications and Future Directions

The contributions of NV-Embed have significant implications for the field of text embeddings. Architecturally, the use of a latent attention layer could inspire further research into more effective pooling techniques for complex sequence representations. Meanwhile, the removal of the causal mask invites discussions on simplifying bidirectional and decoder-based architectures to enhance their effectiveness across diverse tasks.

Practically, NV-Embed's ability to achieve state-of-the-art performance using publicly available data demystifies and democratizes high-performance text embeddings, making them accessible to a broader research community and enabling wider application domains that may not have access to proprietary datasets.

Future directions could include exploring more sophisticated training strategies that blend different tasks dynamically based on model feedback or extending latent attention methods to other types of models and tasks. There is also potential to further optimize the latent attention mechanism and examine its adaptability in transformer variants beyond decoder-only models.

In conclusion, NV-Embed marks a significant step forward in the application of decoder-only LLMs for text embedding tasks. Through innovative architectural designs and strategic training methods, it sets new performance benchmarks while emphasizing simplicity and reproducibility, thereby broadening the scope and accessibility of advanced text embedding research.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.