Emergent Mind

Dense X Retrieval: What Retrieval Granularity Should We Use?

(2312.06648)
Published Dec 11, 2023 in cs.CL , cs.AI , and cs.IR

Abstract

Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our results reveal that proposition-based retrieval significantly outperforms traditional passage or sentence-based methods in dense retrieval. Moreover, retrieval by proposition also enhances the performance of downstream QA tasks, since the retrieved texts are more condensed with question-relevant information, reducing the need for lengthy input tokens and minimizing the inclusion of extraneous, irrelevant information.

Segmenting/indexing a retrieval corpus by proposition enhances dense retrievers' performance in QA tasks.

Overview

  • Dense retrieval systems are critical in NLP for sourcing information efficiently.

  • The paper introduces propositions as a novel unit for information retrieval, aiming to improve the retrieval process.

  • An empirical study compares the effectiveness of different retrieval granularities, with propositions showing superiority.

  • Proposition-based retrieval enhances performance in downstream QA tasks by providing denser relevant information.

  • The paper presents major contributions, including the 'FACTOID WIKI' corpus and showcases the practicality of propositions in dense retrievers.

Introduction

Dense retrieval systems are integral to open-domain NLP applications. They help source relevant information by sifting through large data corpora. One crucial yet often overlooked aspect is the granularity of the retrieval unit—whether a document, passage, or sentence should be indexed and retrieved. This paper introduces a novel concept in dense retrieval that focuses on the granularity of retrieval units and its impact on the retrieval process's efficacy.

Propositions as Retrieval Units

While passages and sentences are routinely used as retrieval units, this paper proposes a different approach: using "propositions" as retrieval units. Propositions are defined as atomic expressions within the text, each elucidating a distinct factoid in a clear, standalone natural language format. Contrary to more extensive passage or complex sentence indexing, proposition indexing presents each fact as a self-contained unit, which could potentially refine retrieval quality.

Empirical Evaluation of Retrieval Granularity

An empirical comparison is drawn among different retrieval granularities utilizing a processed version of the English Wikipedia corpus, termed 'FACTOID WIKI.' This corpus is indexed at the levels of a 100-word passage, a sentence, and a proposition. The study assesses the effectiveness of varying retrieval unit granularities through several experiments. Six different dual-encoder retrievers were tested on five open-domain QA datasets. A significant finding is that proposition-based retrieval substantially outperforms traditional passage or sentence-based methods in dense retrieval tasks.

Downstream Task Performance and Contributions

Propositional retrieval not only improves retrieval but also shows enhanced performance in downstream QA tasks. Propositions, being more condensed, provide a higher density of relevant information, hence requiring fewer input tokens and minimizing the inclusion of irrelevant content. Among the significant contributions are the proposition as a novel retrieval unit for dense retrieval and the introduction of 'FACTOID WIKI.' The study shows proposition retrieval's generalizability and higher accuracy in downstream question-answering tasks within the same input token limit, asserting the practicality of propositions in enhancing dense retrievers' efficient information access.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube