STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases (2404.13207v3)

Published 19 Apr 2024 in cs.IR and cs.LG

Abstract: Answering real-world complex queries, such as complex product search, often requires accurate retrieval from semi-structured knowledge bases that involve blend of unstructured (e.g., textual descriptions of products) and structured (e.g., entity relations of products) information. However, many previous works studied textual and relational retrieval tasks as separate topics. To address the gap, we develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Relational Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties, together with their ground-truth answers (items). We conduct rigorous human evaluation to validate the quality of our synthesized queries. We further enhance the benchmark with high-quality human-generated queries to provide an authentic reference. STARK serves as a comprehensive testbed for evaluating the performance of retrieval systems driven by LLMs. Our experiments suggest that STARK presents significant challenges to the current retrieval and LLM systems, highlighting the need for more capable semi-structured retrieval systems. The benchmark data and code are available on https://github.com/snap-stanford/STaRK.

Authors (10)

Shirley Wu (12 papers)
Shiyu Zhao (55 papers)
Michihiro Yasunaga (48 papers)
Kexin Huang (50 papers)
Kaidi Cao (26 papers)
Qian Huang (55 papers)
Vassilis N. Ioannidis (34 papers)
Karthik Subbian (28 papers)
James Zou (232 papers)
Jure Leskovec (233 papers)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces the STaRK benchmark designed to assess LLM retrieval performance in semi-structured knowledge bases.
It combines relational templates and textual properties to simulate realistic, diverse query scenarios across various domains.
LLM-based reranking demonstrates significant performance gains, highlighting its potential to enhance contextual relevance in retrieval tasks.

Exploring LLMs in Retrieval Systems for Semi-structured Knowledge Bases using STaRK Benchmark

Introduction to Semi-structured Knowledge Bases (SKBs)

SKBs combine relational and textual data, crucial for domains like e-commerce, biomedical informatics, and academic research. Retrieval from SKBs has unique challenges due to the compound nature of user queries and the complexity of the underlying data structure. Current benchmarks often focus on structured or unstructured data, sidelining the nuanced requirements of SKB-based retrieval tasks.

STaRK Benchmark: A Novel Approach

The paper introduces "STaRK" (Semi-structure retrieval benchmark on Textual and Relational Knowledge Bases), tailor-made to assess retrieval systems on their capability to process and extract relevant information from SKBs accurately. STaRK comprises three sub-modules each synthesizing different real-world scenarios: STaRK-Amazon for e-commerce, STaRK-MAG for academic publications, and STaRK-Prime for medical queries.

Key Features Include:

Natural-sounding queries reflecting realistic user interactions.
Queries necessitating context-specific reasoning related to distinct user intentions.
Coverage across varied domains ensuring a comprehensive evaluation.

The benchmark evaluates retrieval performance against several vital metrics, including the naturalness of queries, their diversity, and practical relevance, facilitating a granular assessment of retrieval systems.

Benchmark Generation and Methodology

The construction of STaRK involves a multi-step process:

Sampling Relational Requirements: Queries are initially structured around specific relational templates reflective of real-world data querying.
Extracting Textual Properties: Textually rich descriptions or annotations related to the sampled entities provide the textual layer.
Combining Information: The relational and textual components are coalesced to generate queries that emulate realistic search scenarios.
Constructing Ground Truth Nodes: Multi-model evaluations ensure that only the most relevant entities are retained as correct answers, refining the benchmark's accuracy.

Insights from Data Distribution and Human Evaluation

Analyses show varied distributions in query and answer length across the datasets suggesting diversity in complexity and information depth. Human evaluation underscores the benchmark's realism and relevance, with high positive rates for naturalness, practicality, and diversity of queries.

Experimental Findings and Future Directions

Current models, including both classical retrieval systems and those using modern LLM enhancements, show varied efficacy across the STaRK datasets. Particularly, techniques that incorporate LLM-based reranking after initial retrieval significantly outperform other methods, demonstrating LLMs' ability to enhance contextual understanding and relevance in responses.

However, challenges persist, especially in cases requiring deep relational reasoning or handling queries with complex interdependencies between textual and relational data. Future advancements could focus on integrating more sophisticated reasoning capabilities and reducing latency in LLM-integrated retrieval systems.

Conclusion

The introduction of STaRK marks a significant step towards understanding and improving retrieval systems for SKBs. By providing a robust framework and a comprehensive set of metrics, STaRK allows for a detailed analysis of current LLM-based retrieval models and paves the way for future innovations in this critical area of research.

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1782601038362743060

https://twitter.com/ShirleyYXWu/status/1787564978968699335

https://twitter.com/timoiseppala/status/1785179325341352306

https://twitter.com/adnan_hashmi/status/1785334376517013689

https://twitter.com/GptMaestro/status/1789429108386189507

https://twitter.com/SwankyView/status/1844804032239391173

YouTube

Show All Videos