Emergent Mind

Corpus-Steered Query Expansion with Large Language Models

(2402.18031)
Published Feb 28, 2024 in cs.IR and cs.CL

Abstract

Recent studies demonstrate that query expansions generated by LLMs can considerably enhance information retrieval systems by generating hypothetical documents that answer the queries as expansions. However, challenges arise from misalignments between the expansions and the retrieval corpus, resulting in issues like hallucinations and outdated information due to the limited intrinsic knowledge of LLMs. Inspired by Pseudo Relevance Feedback (PRF), we introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus. CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents. These corpus-originated texts are subsequently used to expand the query together with LLM-knowledge empowered expansions, improving the relevance prediction between the query and the target documents. Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.

Overview

  • The paper introduces Corpus-Steered Query Expansion (CSQE), a method that enhances query expansion by combining the strengths of LLMs with the factual correctness of the retrieval corpus.

  • CSQE involves identifying relevant documents and key sentences from a corpus, then enriching query expansions with this corpus-derived data and LLM-generated expansions.

  • Experimental results show that CSQE outperforms both state-of-the-art models and traditional methods, demonstrating its effectiveness across various datasets without requiring intensive training.

  • The method signifies an advancement in information retrieval by mitigating issues like hallucinations and improving the factuality and uptodateness of information retrieved.

Enhancing Information Retrieval with Corpus-Steered Query Expansion Using LLMs

Introduction to Corpus-Steered Query Expansion (CSQE)

In the domain of information retrieval, the introduction of LLMs has presented a novel approach towards enhancing query expansions, thereby improving the relevance and accuracy of retrieved documents. However, such improvements often come at the cost of generating expansions that may not align well with the retrieval corpus, leading to issues like hallucinations and the inclusion of outdated information. Addressing these challenges, the paper presents Corpus-Steered Query Expansion (CSQE), a method that enhances query expansion by merging the strength of LLMs in relevance assessment with the factual correctness and up-to-dateness inherent in the corpus itself. This approach not only mitigates the limitations associated with solely relying on the intrinsic knowledge of LLMs but also leverages it to identify and incorporate pivotal sentences from the corpus into the query expansion process.

CSQE Methodology

The proposed CSQE technique involves a two-step process where an LLM is first used to identify relevant documents from an initial retrieval set. Subsequently, it extracts key sentences that contribute significantly to the relevance of these documents. These corpus-derived expansions are then amalgamated with expansions generated through the LLM's intrinsic knowledge to enrich the original query. This hybrid approach, grounded in both corpus originated texts and LLM-generated expansions, is designed to enhance the relevancy and factuality of the expanded query, outperforming traditional methods that solely depend on LLMs.

Experimental Verification and Results

The effectiveness of CSQE was rigorously tested across both high-resource web search datasets and low-resource retrieval datasets spanning a variety of domains. The comparison against state-of-the-art (SOTA) models and traditional pseudo relevance feedback (PRF) methods demonstrated the superior performance of CSQE, highlighting its robustness and generalizability across different settings. Notably, the integration of CSQE with a basic BM25 model yielded significant improvements over LLM-knowledge empowered expansions and even surpassed the performance of the ContrieverFT model across all evaluated metrics, without necessitating any form of training.

Future Implications and Prospects

The introduction of CSQE signifies a promising advancement in the field of information retrieval, showcasing the potential of leveraging the synergistic capabilities of LLMs and corpus-derived data. The demonstrated proficiency in mitigating issues like hallucinations and incorporating up-to-date information from the corpus opens avenues for further exploration into hybrid models that combine the comprehensive knowledge of LLMs with the factual accuracy inherent in specific corpora. Furthermore, the flexibility of CSQE in adapting to various datasets without the need for intensive training or domain-specific fine-tuning presents an accessible solution for enhancing query expansion in information retrieval systems.

Conclusion

The Corpus-Steered Query Expansion method represents a significant step forward in addressing the existing limitations of LLM-based query expansions by strategically incorporating corpus-originated texts. The approach not only capitalizes on the extensive knowledge base of LLMs but also ensures that the expansions remain grounded in the factual and relevant content of the corpus, thereby improving both the effectiveness and reliability of information retrieval systems. The promising results and the method's adaptability to different domains underscore the potential of CSQE as a versatile tool in the evolving landscape of search technologies.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.