Self-Retrieval: End-to-End Information Retrieval with One Large Language Model (2403.00801v2)

Published 23 Feb 2024 in cs.IR, cs.CL, and cs.AI

Abstract: The rise of LLMs has significantly transformed both the construction and application of information retrieval (IR) systems. However, current interactions between IR systems and LLMs remain limited, with LLMs merely serving as part of components within IR systems, and IR systems being constructed independently of LLMs. This separated architecture restricts knowledge sharing and deep collaboration between them. In this paper, we introduce Self-Retrieval, a novel end-to-end LLM-driven information retrieval architecture. Self-Retrieval unifies all essential IR functions within a single LLM, leveraging the inherent capabilities of LLMs throughout the IR process. Specifically, Self-Retrieval internalizes the retrieval corpus through self-supervised learning, transforms the retrieval process into sequential passage generation, and performs relevance assessment for reranking. Experimental results demonstrate that Self-Retrieval not only outperforms existing retrieval approaches by a significant margin, but also substantially enhances the performance of LLM-driven downstream applications like retrieval-augmented generation.

References (38)

Authors (13)

Qiaoyu Tang (5 papers)
Jiawei Chen (162 papers)
Bowen Yu (89 papers)
Yaojie Lu (61 papers)
Cheng Fu (12 papers)
Haiyang Yu (109 papers)
Hongyu Lin (94 papers)
Fei Huang (410 papers)
Ben He (37 papers)
Xianpei Han (103 papers)
Le Sun (111 papers)
Yongbin Li (129 papers)
Zhuoqun Li (7 papers)

Summary

The paper introduces a unified end-to-end LLM architecture that internalizes document indexing, retrieval, and self-assessment.
It leverages self-supervised learning and constrained decoding to generate natural language indexes aligning with the document corpus.
Experimental results demonstrate significant improvements over traditional methods, enhancing downstream tasks like retrieval-augmented generation.

Self-Retrieval: An End-to-End LLM-driven Information Retrieval Architecture

Introduction

The integration and interaction of LLMs with Information Retrieval (IR) systems have been progressing, yet traditional IR systems lag in meeting the modern requirements posed by LLMs. This paper introduces Self-Retrieval, a novel architecture that internalizes the functionalities of an IR system entirely within a LLM. This model not only transforms how documents are indexed, retrieved, and assessed but also harnesses the innate abilities of LLMs for a more integrated retrieval process. The architecture outpaces conventional retrieval methods and demonstrates potential enhancements for downstream applications like retrieval-augmented generation.

Self-Retrieval Architecture

Self-Retrieval is structured around three core processes: indexing, retrieval through natural language indexing, and self-assessment. This approach integrates the document corpus into the LLM via self-supervised learning to create a natural language-described index within the model. Upon receiving a query, the model generates relevant documents through a two-step generative process, eventually assessing the quality of these generated documents vis-à-vis the query. This method leverages the LLM's comprehensive capabilities, including semantic understanding and generation, thereby redefining retrieval through a lens of deep semantic understanding and end-to-end processing.

Internalization and Indexing

The initial step involves internalizing the document corpus into the LLM, enabling it to build implicit index structures described in natural language. This process uses self-supervised learning to memorize documents as indexes and the broader corpus, establishing a basis for retrieval that leverages the LLM's deep understanding capabilities.

Natural Language Index-driven Retrieval

This procedure involves first generating a natural language index from the input query and then generating the relevant document based on this index. To ensure the generated content aligns closely with the corpus, the model employs constrained decoding algorithms, emphasizing the alignment with encoded documents and further enhancing retrieval accuracy.

Self-Assessment

Self-Retrieval uniquely incorporates a self-assessment phase where the model evaluates the relevance of the retrieved document to the input query. This phase utilizes pseudo-relevance feedback, allowing the LLM to assign scores to documents based on their generated relevance, ensuring a more targeted and meaningful retrieval output.

Experimental Insights

Experiments conducted on open-domain question answering tasks illustrate that Self-Retrieval significantly outperforms existing sparse, dense, and generative retrieval methods. This model not only demonstrates superior retrieval capabilities but also significantly enhances the performance of downstream tasks like retrieval augmented generation (RAG), suggesting a powerful synergy between retrieval mechanisms and generative model capabilities.

Theoretical and Practical Implications

Self-Retrieval presents a groundbreaking shift in the interaction between IR systems and LLMs, proposing a unified model that internalizes the retrieval process. Theoretically, it advances our understanding of how LLMs can be leveraged for complex retrieval tasks, integrating storage, semantic understanding, and document generation within a single model. Practically, it sets the groundwork for more sophisticated IR systems that can seamlessly integrate with LLM-driven applications, enhancing both retrieval accuracy and the efficiency of downstream processes.

Future Directions

While Self-Retrieval has demonstrated significant advancements, it also opens new avenues for research, particularly in understanding the scaling laws between document corpus sizes and model parameters, and expanding the model's application across various domains and tasks. Probing further into these areas could elucidate the full potential of LLM-driven IR systems and their impact on information access and knowledge discovery.

Conclusion

The Self-Retrieval architecture heralds a new era in information retrieval, wherein the capacities of LLMs are fully harnessed to perform end-to-end retrieval tasks. By internalizing the entire retrieval process within an LLM, it bridges the gap between traditional IR systems and the adaptive, semantic-rich capabilities of modern LLMs, offering a promising direction for future IR system development.

Limitations and Considerations

The paper also acknowledges the limitations of the current implementation of Self-Retrieval, specifically the need to explore the optimal scaling between the size of the document corpus and the parameters of the model. Moreover, the exploration of the model's effectiveness across various downstream tasks beyond retrieval augmented generation remains a fruitful area for future research.

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1765271039032893542

https://twitter.com/fly51fly/status/1766944421298508263

https://twitter.com/Montreal_AI/status/1765590089021755782

https://twitter.com/ceobillionaire/status/1765592629931028921

https://twitter.com/Quebec_AI/status/1765594404725624855

https://twitter.com/betterhn50/status/1766359482697728462

HackerNews

Self-Retrieval: Building an information retrieval system with one LLM (199 points, 29 comments)

Reddit

Self-Retrieval: Building an information retrieval system with one LLM (0 points, 1 comment)