Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

126 tokens/sec

GPT-4o

47 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

120 6

Dense X Retrieval: What Retrieval Granularity Should We Use? (2312.06648v3)

Published 11 Dec 2023 in cs.CL, cs.AI, and cs.IR

Abstract: Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented LLMs improves the performance of downstream QA tasks given a specific computation budget.

References (67)

Citations (17)

View on Semantic Scholar

Summary

The paper introduces proposition-based retrieval, demonstrating a novel approach that outperforms traditional passage and sentence methods.
Empirical tests on FACTOID WIKI reveal that proposition retrieval achieves higher accuracy in open-domain QA with fewer tokens.
The study underscores the practicality of proposition indexing for efficient and scalable dense retrieval in NLP applications.

Introduction

Dense retrieval systems are integral to open-domain NLP applications. They help source relevant information by sifting through large data corpora. One crucial yet often overlooked aspect is the granularity of the retrieval unit—whether a document, passage, or sentence should be indexed and retrieved. This paper introduces a novel concept in dense retrieval that focuses on the granularity of retrieval units and its impact on the retrieval process's efficacy.

Propositions as Retrieval Units

While passages and sentences are routinely used as retrieval units, this paper proposes a different approach: using "propositions" as retrieval units. Propositions are defined as atomic expressions within the text, each elucidating a distinct factoid in a clear, standalone natural language format. Contrary to more extensive passage or complex sentence indexing, proposition indexing presents each fact as a self-contained unit, which could potentially refine retrieval quality.

Empirical Evaluation of Retrieval Granularity

An empirical comparison is drawn among different retrieval granularities utilizing a processed version of the English Wikipedia corpus, termed 'FACTOID WIKI.' This corpus is indexed at the levels of a 100-word passage, a sentence, and a proposition. The paper assesses the effectiveness of varying retrieval unit granularities through several experiments. Six different dual-encoder retrievers were tested on five open-domain QA datasets. A significant finding is that proposition-based retrieval substantially outperforms traditional passage or sentence-based methods in dense retrieval tasks.

Downstream Task Performance and Contributions

Propositional retrieval not only improves retrieval but also shows enhanced performance in downstream QA tasks. Propositions, being more condensed, provide a higher density of relevant information, hence requiring fewer input tokens and minimizing the inclusion of irrelevant content. Among the significant contributions are the proposition as a novel retrieval unit for dense retrieval and the introduction of 'FACTOID WIKI.' The paper shows proposition retrieval's generalizability and higher accuracy in downstream question-answering tasks within the same input token limit, asserting the practicality of propositions in enhancing dense retrievers' efficient information access.

PDF Markdown

Tweets

https://twitter.com/tomchen0/status/1856199766499041572

https://twitter.com/jerryjliu0/status/1750608019833147757

https://twitter.com/781542714/status/1736434578569441501

https://twitter.com/wyu_nd/status/1845931635423993872

https://twitter.com/shrihacker/status/1770667358895681804

https://twitter.com/imaurer/status/1839831166498623744

YouTube

Show All Videos