Condenser: a Pre-training Architecture for Dense Retrieval

Published 16 Apr 2021 in cs.CL and cs.IR | (2104.08253v2)

Abstract: Pre-trained Transformer LLMs (LM) have become go-to text representation encoders. Prior research fine-tunes deep LMs to encode text sequences such as sentences and passages into single dense vector representations for efficient text comparison and retrieval. However, dense encoders require a lot of data and sophisticated techniques to effectively train and suffer in low data situations. This paper finds a key reason is that standard LMs' internal attention structure is not ready-to-use for dense encoders, which needs to aggregate text information into the dense representation. We propose to pre-train towards dense encoder with a novel Transformer architecture, Condenser, where LM prediction CONditions on DENSE Representation. Our experiments show Condenser improves over standard LM by large margins on various text retrieval and similarity tasks.

Abstract PDF Upgrade to Chat

Citations (235)

View on Semantic Scholar

Summary

The paper introduces Condenser, a novel approach that pre-trains language models to embed structural readiness for efficient dense retrieval.
It modifies standard Transformers by combining early and late layers with a dedicated Condenser head to enhance Masked Language Model predictions.
Experimental results show that Condenser outperforms traditional methods, especially in low-data scenarios for tasks like question answering and sentence similarity.

An Overview of "Condenser: a Pre-training Architecture for Dense Retrieval"

The paper "Condenser: a Pre-training Architecture for Dense Retrieval" by Luyu Gao and Jamie Callan from Carnegie Mellon University introduces a novel Transformer-based architecture designed specifically for dense information retrieval tasks. The proposed architecture, Condenser, addresses the inefficiencies associated with using pre-trained LMs like BERT for encoding text into dense vector representations. The research identifies that the internal attention mechanisms of standard Transformer models are not optimally structured for aggregating text information into dense representations required for dense retrieval.

Background and Motivation

The current standard practice in dense retrieval tasks involves fine-tuning deep bidirectional Transformer encoders to transform individual text sequences into single vector representations. This method has proven effective in many downstream tasks, but faces significant challenges. Models designed as bi-encoders require large amounts of data and sophisticated training methods to achieve efficient encoding performance. Moreover, they struggle with performance degradation in low-data scenarios due to a lack of structural readiness – that is, their internal attention patterns are not preconditioned to facilitate efficient dense information aggregation.

The Condenser Architecture

Condenser is proposed as a pre-training architecture that embeds structural readiness into bi-encoders. It modifies the typical Transformer encoder by introducing a Condenser head that pre-trains LLMs to condition on dense representations. This involves an architecture where early and late backbone layers are sequentially processed, with a final Condenser head using both early and late representations to perform Masked LLM (MLM) predictions.

Key architectural elements include:

Early and Late Backbone Layers: These layers are processed sequentially to split representation tasks into different layers.
Condenser Head: This component actively conditions on dense representations during the MLM pre-training phase, promoting aggregation of global sentence information across all layers.

In fine-tuning scenarios, the Condenser head is discarded, allowing the pre-trained backbone, which now has the learned structural readiness, to effectively function as a dense retriever.

Experimental Evaluation

The experimental results demonstrate that Condenser pre-training substantially enhances the performance of dense retrieval tasks across various benchmarks, especially in low-data setups. Notably, the architecture showed improvements in tasks involving sentence similarity and open-domain question answering (QA), often outperforming traditional pre-trained LMs and task-specific pre-trained models like the Inverse Cloze Task (ICT).

In high-data tasks, Condenser's performance was found to align with or surpass complicated fine-tuning approaches like those using hard negatives and advanced distillation techniques. This highlights Condenser's potential to simplify training pipelines while providing robust performance benefits.

Theoretical and Practical Implications

Theoretically, the research introduces a compelling approach to embedding structural readiness in LLMs, paving the way for more efficient deployment of bi-encoders across various retrieval tasks. Practically, Condenser presents a cost-effective alternative to extensive data-specific pre-training or retriever-specific adjustments, offering improved performance with reduced computational complexities.

Future Directions

The research invites future exploration into leveraging Condenser for other pre-training objectives and incorporation into broader NLP tasks requiring dense representations. With advancements likely in both the architecture and its integration with fine-tuning techniques, Condenser promises to play a significant role in the ongoing development of efficient, scalable dense retrieval systems. The team notes that further optimization of the architecture and hyperparameter tuning could yield even more significant gains, suggesting a promising avenue for future study.

In summary, "Condenser: a Pre-training Architecture for Dense Retrieval" presents a methodologically incisive advance in the design of pre-training models, specifically tailored for overcoming existing limitations in dense retrieval tasks by establishing an effective internal structure in pre-trained LMs.

Markdown Report Issue