Emergent Mind

Abstract

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval, which some argue was due to the limited model capacity. We contradict this hypothesis and show that a generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size. In particular, we systematically examine the contrastive learning of DRs, under the framework of Data Augmentation (DA). Our study shows that common DA practices such as query augmentation with generative models and pseudo-relevance label creation using a cross-encoder, are often inefficient and sub-optimal. We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction (ColBERTv2 and SPLADE++).

Overview

  • The paper introduces an innovative data augmentation (DA) strategy within the contrastive learning framework for dense retrievers (DRs), aiming to improve their generalizability without increasing model size.

  • It demonstrates that employing diverse queries and leveraging various sources of supervision can lead to state-of-the-art effectiveness in both supervised and zero-shot evaluations.

  • Empirical insights reveal that cheap, large-scale augmented queries and multiple relevance signals can significantly enhance a retriever's performance.

  • The research introduces DRAGON, a BERT-base-sized dense retriever, which showcases remarkable retrieval effectiveness, suggesting its potential as a robust model for domain adaptation in retrieval systems.

Diverse Augmentation Strategies for Training Generalizable Dense Retrievers

Introduction

In the realm of information retrieval, dense retrievers (DRs) have gained prominence for their ability to efficiently sift through large datasets to find relevant information. Existing DR training methodologies, including unsupervised contrastive learning and pseudo-query generation, have shown promise but often at the expense of either supervised or zero-shot retrieval effectiveness. The common belief links this trade-off to limited model capacity. Challenging this notion, new research demonstrates that a generalizable dense retriever can be trained to achieve high accuracy across both tasks without necessarily increasing the model size. The key lies in a systematic examination of data augmentation (DA) practices within the contrastive learning framework for DRs.

Data Augmentation for Contrastive Learning

The study identifies common DA practices—such as query augmentation with generative models and label creation using cross-encoders—as often inefficient and sub-optimal. It introduces a novel DA approach, focusing on developing diverse queries and leveraging various sources of supervision. This method enables the progressive training of a generalizable DR, achieving state-of-the-art effectiveness in both supervised and zero-shot evaluations, even outperforming models reliant on more complex late interaction mechanisms.

Empirical Insights

Through detailed empirical exploration, the research uncovers pivotal insights for DR training. In particular:

  • Relevance Label Augmentation: The challenge in training generalizable DRs lies in creating diverse relevance labels for each query. By employing multiple retrievers, as opposed to solely relying on a strong cross-encoder, the study illustrates the effectiveness of leveraging a range of relevance signals.
  • Query Augmentation: The findings advocate for using cheap and large-scale augmented queries (e.g., cropped sentences) rather than expensive neural generative queries. This approach not only reduces costs but also enhances the retriever's capability to generalize across different domains.

Moreover, the direct learning from diverse relevance labels sourced from multiple retrievers is highlighted as suboptimal. The study proposes a method for progressively augmenting relevance labels, thereby facilitating more effective learning.

Contributions and Practical Implications

The paper makes several notable contributions. It presents a systematic evaluation of DR training under the lens of data augmentation, shedding light on how to improve training methods for dense retrievers. The introduction of a progressive label augmentation strategy is particularly noteworthy for guiding the learning of complex relevance signals. Practically, the research showcases DRAGON, a BERT-base-sized dense retriever, which excels in retrieval effectiveness without increased model complexity. This advancement suggests the viability of employing DRAGON as a robust foundation model for domain adaptation tasks in retrieval systems.

Speculations on Future Developments

Looking ahead, the findings prompt a reevaluation of the role of data augmentation and the training of dense retrievers. The remarkable performance of DRAGON—armed with a diverse augmentation strategy—hints at the untapped potential of existing model architectures when coupled with innovative training regimes. Future research may explore the integration of generative and contrastive pre-training or delve into domain-specific pre-training to address identified weaknesses in zero-shot retriever tasks. Such explorations could further diminish the gap between supervised and zero-shot effectiveness, paving the way for more versatile and efficient retrieval systems.

In sum, this research stands as a testament to the power of strategic data augmentation in enhancing the generalizability of dense retrievers. By rethinking conventional training paradigms, DRAGON emerges as a testament to the potential within reach, heralding a new era for information retrieval systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.