NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models (2405.17428v3)

Published 27 May 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last <EOS> token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For training algorithm, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. For training data, we utilize the hard-negative mining, synthetic data generation and existing public available datasets to boost the performance of embedding model. By combining these techniques, our NV-Embed-v1 and NV-Embed-v2 models obtained the No.1 position on the MTEB leaderboard (as of May 24 and August 30, 2024, respectively) across 56 tasks, demonstrating the sustained effectiveness of the proposed methods over time. It also achieved the highest scores in the Long Doc section and the second-highest scores in the QA section of the AIR Benchmark, which covers a range of out-of-domain information retrieval topics beyond those in MTEB. We further provide the analysis of model compression techniques for generalist embedding models.

References (66)

Citations (58)

View on Semantic Scholar

Summary

The paper introduces NV-Embed, which uses a latent attention layer to overcome pooling limitations in decoder-only LLMs for improved text embeddings.
It employs a two-stage contrastive training approach that leverages retrieval and instruction-tuning to enhance performance across tasks.
Empirical results demonstrate state-of-the-art performance with top scores on MTEB and BEIR benchmarks using entirely public datasets.

The NV-Embed Model: Advancements in Decoder-Only LLMs for Text Embedding Tasks

This paper presents the NV-Embed model, which pushes the boundaries of decoder-only LLMs as versatile text embedding models. Current trends in text embedding have shown that decoder-only LLM-based models have begun to outperform traditional bidirectional models such as BERT and T5 in general-purpose text embedding tasks, including dense vector-based retrieval. NV-Embed incorporates these advancements with novel architectural and training modifications to achieve better performance while maintaining simplicity and reproducibility.

Architectural Innovations

A prominent feature of NV-Embed is the introduction of a latent attention layer specifically designed to extract pooled embeddings from sequences of tokens. In traditional approaches, embeddings are obtained using either mean pooling or the embedding of the last <EOS> token. Both of these methods have their limitations: mean pooling might dilute crucial semantic information distributed across the token sequence, whereas the last token embedding could suffer from recency bias. The latent attention layer proposed in NV-Embed mitigates these issues by employing a form of cross-attention where the hidden states serve as queries and the keys and values come from a trainable latent array. This setup enables the model to better capture and represent the complex structure of the input sequence.

Another significant architectural choice is the removal of the causal attention mask during contrastive training. While the causal mask in decoder-only LLMs is essential for preserving autoregressive properties in generation tasks, it limits the model's ability to fully utilize bidirectional context when learning embeddings. By simply eliminating the causal mask, NV-Embed harnesses the full potential of bidirectional attention, which enhances representation learning without the need for additional complex training phases, as seen in related works.

Training Enhancements

On the training side, NV-Embed employs a two-stage contrastive instruction-tuning approach. The first stage focuses on retrieval datasets, applying contrastive training with in-batch negatives and curated hard negatives. After an initial contrastive training phase, the model undergoes a second stage blending various non-retrieval datasets into instruction tuning. This stage is designed to improve not only non-retrieval tasks such as classification, clustering, and semantic textual similarity (STS) but also to provide unexpected gains in retrieval performance.

This two-stage methodology is distinct from previous models, which often do not separate the stages based on task difficulties or characteristics but rather apply a unified training strategy across all tasks. Through these meticulous design choices, NV-Embed delivers a model that achieves remarkable scores in diverse embedding benchmarks without relying on proprietary synthetic data, underscoring its reproducibility using entirely public datasets.

Empirical Results

The empirical evaluation of NV-Embed demonstrates its competitive edge. Achieving a top score of 69.32 on the Massive Text Embedding Benchmark (MTEB), which includes 56 tasks ranging from retrieval and reranking to classification and semantic similarity, it surpasses prior leading models like E5-Mistral-7B-Instruct and SFR-Embedding. Notably, NV-Embed sets a new record of 59.36 on the BEIR retrieval benchmark, indicative of its superior ability in dense vector-based retrieval tasks.

Implications and Future Directions

The contributions of NV-Embed have significant implications for the field of text embeddings. Architecturally, the use of a latent attention layer could inspire further research into more effective pooling techniques for complex sequence representations. Meanwhile, the removal of the causal mask invites discussions on simplifying bidirectional and decoder-based architectures to enhance their effectiveness across diverse tasks.

Practically, NV-Embed's ability to achieve state-of-the-art performance using publicly available data demystifies and democratizes high-performance text embeddings, making them accessible to a broader research community and enabling wider application domains that may not have access to proprietary datasets.

Future directions could include exploring more sophisticated training strategies that blend different tasks dynamically based on model feedback or extending latent attention methods to other types of models and tasks. There is also potential to further optimize the latent attention mechanism and examine its adaptability in transformer variants beyond decoder-only models.

In conclusion, NV-Embed marks a significant step forward in the application of decoder-only LLMs for text embedding tasks. Through innovative architectural designs and strategic training methods, it sets new performance benchmarks while emphasizing simplicity and reproducibility, thereby broadening the scope and accessibility of advanced text embedding research.