Emergent Mind

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

(2404.13207)
Published Apr 19, 2024 in cs.IR and cs.LG

Abstract

Answering real-world complex queries, such as complex product search, often requires accurate retrieval from semi-structured knowledge bases that involve blend of unstructured (e.g., textual descriptions of products) and structured (e.g., entity relations of products) information. However, previous works have mostly studied textual and relational retrieval tasks as separate topics. To address the gap, we develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Relational K nowledge Bases. Our benchmark covers three domains/datasets: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties, together with their ground-truth answers (items). We conduct rigorous human evaluation to validate the quality of our synthesized queries. We further enhance the benchmark with high-quality human-generated queries to provide an authentic reference. STARK serves as a comprehensive testbed for evaluating the performance of retrieval systems driven by LLMs. Our experiments suggest that STARK presents significant challenges to the current retrieval and LLM systems, indicating the demand for building more capable retrieval systems. The benchmark data and code are available on https://github.com/snap-stanford/stark.

Overview

  • The paper introduces the STaRK benchmark, designed to evaluate retrieval systems in semi-structured knowledge bases that combine relational and textual data.

  • STaRK includes three sub-modules tailored to different domains: e-commerce, academic publications, and medical queries, each presenting unique challenges and scenarios.

  • The benchmark uses a multi-step generation process that involves sampling relational requirements, extracting textual properties, combining information, and constructing ground truth nodes for accurate assessments.

  • Experimental results show that LLM-based reranking methods significantly improve retrieval performance, although challenges remain in handling complex queries involving deep relational reasoning.

Exploring LLMs in Retrieval Systems for Semi-structured Knowledge Bases using STaRK Benchmark

Introduction to Semi-structured Knowledge Bases (SKBs)

SKBs combine relational and textual data, crucial for domains like e-commerce, biomedical informatics, and academic research. Retrieval from SKBs has unique challenges due to the compound nature of user queries and the complexity of the underlying data structure. Current benchmarks often focus on structured or unstructured data, sidelining the nuanced requirements of SKB-based retrieval tasks.

STaRK Benchmark: A Novel Approach

The paper introduces "STaRK" (Semi-structure retrieval benchmark on Textual and Relational Knowledge Bases), tailor-made to assess retrieval systems on their capability to process and extract relevant information from SKBs accurately. STaRK comprises three sub-modules each synthesizing different real-world scenarios: STaRK-Amazon for e-commerce, STaRK-MAG for academic publications, and STaRK-Prime for medical queries.

Key Features Include:

  • Natural-sounding queries reflecting realistic user interactions.
  • Queries necessitating context-specific reasoning related to distinct user intentions.
  • Coverage across varied domains ensuring a comprehensive evaluation.

The benchmark evaluates retrieval performance against several vital metrics, including the naturalness of queries, their diversity, and practical relevance, facilitating a granular assessment of retrieval systems.

Benchmark Generation and Methodology

The construction of STaRK involves a multi-step process:

  1. Sampling Relational Requirements: Queries are initially structured around specific relational templates reflective of real-world data querying.
  2. Extracting Textual Properties: Textually rich descriptions or annotations related to the sampled entities provide the textual layer.
  3. Combining Information: The relational and textual components are coalesced to generate queries that emulate realistic search scenarios.
  4. Constructing Ground Truth Nodes: Multi-model evaluations ensure that only the most relevant entities are retained as correct answers, refining the benchmark's accuracy.

Insights from Data Distribution and Human Evaluation

Analyses show varied distributions in query and answer length across the datasets suggesting diversity in complexity and information depth. Human evaluation underscores the benchmark's realism and relevance, with high positive rates for naturalness, practicality, and diversity of queries.

Experimental Findings and Future Directions

Current models, including both classical retrieval systems and those using modern LLM enhancements, show varied efficacy across the STaRK datasets. Particularly, techniques that incorporate LLM-based reranking after initial retrieval significantly outperform other methods, demonstrating LLMs' ability to enhance contextual understanding and relevance in responses.

However, challenges persist, especially in cases requiring deep relational reasoning or handling queries with complex interdependencies between textual and relational data. Future advancements could focus on integrating more sophisticated reasoning capabilities and reducing latency in LLM-integrated retrieval systems.

Conclusion

The introduction of STaRK marks a significant step towards understanding and improving retrieval systems for SKBs. By providing a robust framework and a comprehensive set of metrics, STaRK allows for a detailed analysis of current LLM-based retrieval models and paves the way for future innovations in this critical area of research.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube