SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Published 6 Jul 2022 in cs.IR | (2207.02578v2)

Abstract: In this paper, we propose SimLM (Similarity matching with LLM pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost. Our code and model check points are available at https://github.com/microsoft/unilm/tree/master/simlm .

Abstract PDF Upgrade to Chat

Authors (8)

Citations (91)

View on Semantic Scholar

Summary

The paper introduces SimLM, which employs a representation bottleneck to efficiently compress passage information for dense retrieval.
It replaces masked language modeling with a replaced token detection objective, bridging pre-training and fine-tuning gaps while enhancing performance.
Experiments on MS-MARCO and Natural Questions demonstrate improved accuracy and storage efficiency relative to strong baselines.

Overview of SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

The paper introduces SimLM, a novel pre-training methodology targeted at enhancing dense passage retrieval. Dense retrieval has become a key component in information retrieval systems due to its ability to map queries and passages into a low-dimensional vector space, facilitating semantic comparison. SimLM proposes a straightforward yet efficient pre-training technique that leverages a representation bottleneck architecture.

Key Contributions

SimLM's architecture features a deep encoder and a shallow decoder tied together by a representation bottleneck, specifically the [CLS] vector. This bottleneck is central to compressing essential passage information, ensuring that the downstream retrieval tasks are effective when fine-tuning.

Replaced Language Modeling Objective: SimLM employs a replaced token detection strategy inspired by ELECTRA, which significantly increases sample efficiency. This method helps in bridging the gap between pre-training and fine-tuning, a common challenge in dense retrieval tasks.
Self-Supervised Pre-Training: The method does not rely on labeled data or queries, widening its applicability across various scenarios where labeled data is unavailable.
Performance Metrics: The paper reports substantial improvements over existing strong baselines like BM25 and multi-vector approaches such as ColBERTv2, across datasets including MS-MARCO and Natural Questions (NQ).

Experimental Results

The experimental validation of SimLM exhibits notable performance enhancements. On the MS-MARCO passage ranking dataset, SimLM achieves an MRR@10 of 41.1, outperforming models like ColBERTv2, which have significantly higher storage costs. This indicates SimLM's ability to effectively retain semantic information with greater storage efficiency. Similarly, on the NQ dataset, SimLM achieves R@20 of 85.2 and R@100 of 89.7.

Comparison with Existing Methods

SimLM's approach contrasts with other pre-training methods like Condenser and coCondenser by omitting skip connections between encoder and decoder layers. This excludes potential bypassing effects, compelling the bottleneck to encode all vital information. SimLM's replaced language modeling objective also offers superior gradient propagation compared to typical masked language modeling techniques.

Implications and Future Directions

The introduction of SimLM expands the potential for developing efficient dense retrieval systems, especially where query-labeled data is sparse. Its architecture can be seamlessly integrated into current retrieval pipelines without extensive modifications, suggesting broad applicability. The compact representation from the bottleneck leads to lower computational and storage costs, providing a practical edge in real-world applications.

Potential future work could explore scaling the model size and corpus to further leverage the capabilities of SimLM. Additionally, evaluating multilingual retrieval and zero-shot capabilities could open up new research avenues, given the method's inherent flexibility.

Conclusion

SimLM represents an advance in pre-training techniques for dense passage retrieval. It delivers improved retrieval quality and storage efficiency, offering substantial value to information retrieval systems. While certain limitations remain, including reliance on fine-tuning for optimal performance, SimLM sets a foundation for effective retrieval models across diverse environments.

Markdown Report Issue