Emergent Mind

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

(2404.16130)
Published Apr 24, 2024 in cs.CL , cs.AI , and cs.IR

Abstract

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables LLMs to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as "What are the main themes in the dataset?", since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely-related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG leads to substantial improvements over a na\"ive RAG baseline for both the comprehensiveness and diversity of generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https://aka.ms/graphrag.

Overview

  • Graph RAG introduces a novel method combining LLMs and graph theory to better summarize and answer complex questions across entire text datasets by creating a scalable, graph-based index.

  • The methodology involves three main steps: data indexing where text is processed into entities and relationships, community detection and summarization using algorithms like the Leiden algorithm, and query handling that combines responses from relevant summaries for comprehensive answers.

  • The system has been proven to outperform traditional and naive RAG methods in efficiency, comprehensiveness, and diversity of answers, and holds promising implications for improving data analysis in fields like legal review, academic research, and business intelligence.

Rethinking Question Answering over Large Text Corpora with Graph RAG

Overview

A novel enhancement to Retrieval-Augmented Generation (RAG) termed Graph RAG has been introduced, aiming to address the inefficiencies of existing RAG systems in handling global queries that necessitate a broad understanding of entire text datasets. This approach innovatively combines the idea of a LLM with graph theory to encompass entire text corpora into a scalable graph-based index. This index breaks down corpora into a hierarchical structure of interconnected community summaries, allowing comprehensive summarization and the answering of complex, global sensemaking questions with unprecedented depth and breadth.

Graph RAG Methodology

The Graph RAG framework starts with the processing of text into entities and relations, which are then organized into a graph. Community detection algorithms subdivide this graph into closely-knit communities, which can be independently summarized:

  • Data Indexing: Employing an LLM, the input texts are divided into manageable chunks and processed to detect and extract entities and relationships. These elements are then integrated into a knowledge graph.
  • Community Detection and Summarization: Leveraging graph community detection techniques, such as the Leiden algorithm, the graph is partitioned into communities. Each of these communities is then summarized to represent the cumulative knowledge or themes of that subset.
  • Query Handling: When a query is made, the relevant community summaries generate partial answers, which are subsequently combined using a "reduce" function to form a complete, coherent response.

Key Findings

The implementation of Graph RAG was evaluated against traditional text summarization and naive RAG approaches using test corpora of podcast transcripts and news articles. The evaluation focused on the metrics of comprehensiveness, diversity, and empowerment:

  • Improved Performance: Graph RAG consistently outperformed naive RAG across all metrics, showing significant improvements especially in comprehensiveness and diversity, demonstrating its ability to generate more informative and varied responses.
  • Efficiency in Token Usage: The approach was also more efficient, using fewer tokens to achieve more comprehensive summarization, particularly when employing summaries of root-level communities for answering queries.
  • Adaptability: The hierarchical structure of community summaries provided flexibility in answer detail, accommodating different levels of query complexity.

Implications and Future Work

The Graph RAG model holds substantial implications for fields requiring deep analysis of large text datasets such as legal document review, academic research, and business intelligence. It allows users to interactively explore data by asking increasingly specific questions, effectively "drilling down" into topics of interest. This could substantially alter workflows in data-heavy domains by providing quicker, context-aware insights.

Potential enhancements for Graph RAG could include exploring more operative node and edge extraction techniques to refine graph quality or integrating multi-modal data sources into the graph structure. Further research might also investigate the effects of varying the granularity of text chunks and graph communities on the quality of generated answers.

Conclusion

Graph RAG represents a significant advancement in utilizing LLMs for data sensemaking across vast text corpora. By effectively merging the capabilities of knowledge graphs and the generative power of LLMs, it offers a robust framework for answering extensive, complex queries that demand a holistic view of the underlying data. This research thus sets the stage for further explorations into enhancing the capacity and efficiency of information retrieval in large-scale data environments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube