BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval (2407.12883v3)

Published 16 Jul 2024 in cs.CL, cs.AI, and cs.IR

Abstract: Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding. These queries are drawn from naturally occurring and carefully curated human data. Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard (Muennighoff et al., 2023), which achieves a score of 59.0 nDCG@10, produces a score of nDCG@10 of 18.3 on BRIGHT. We show that incorporating explicit reasoning about the query improves retrieval performance by up to 12.2 points. Moreover, incorporating retrieved documents from the top-performing retriever boosts question-answering performance by over 6.6 points. We believe that BRIGHT paves the way for future research on retrieval systems in more realistic and challenging settings.

Authors (15)

Hongjin Su (10 papers)
Howard Yen (10 papers)
Mengzhou Xia (34 papers)
Weijia Shi (56 papers)
Niklas Muennighoff (56 papers)
Han-Yu Wang (4 papers)
Haisu Liu (1 paper)
Quan Shi (26 papers)
Zachary S. Siegel (5 papers)
Michael Tang (14 papers)
Ruoxi Sun (58 papers)
Jinsung Yoon (55 papers)
Danqi Chen (84 papers)
Tao Yu (282 papers)
Sercan O. Arik (40 papers)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces BRIGHT, a benchmark compiling 1,398 complex real-world queries to challenge retrieval models with reasoning-intensive tasks.
It evaluates 13 models, revealing that even advanced dense retrieval systems struggle with low nDCG@10 scores around 22.1, highlighting reasoning limitations.
The study shows that integrating LLM-generated reasoning steps for query augmentation and document reranking yields performance improvements, yet intrinsic challenges remain.

A Formal Analysis of "A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval"

The paper presents a comprehensive benchmark, referred to as BRIGHT, specifically designed for evaluating retrieval systems on tasks that necessitate intensive reasoning. This endeavor demonstrates a shift from the traditional focus on keyword and semantic matching evident in existing benchmarks, such as Natural Questions, MS MARCO, and BEIR. The paper provides a detailed methodology for constructing this benchmark from naturally occurring data across varied domains, an extensive evaluation using state-of-the-art models, and presents new paradigms for augmenting retrieval tasks with reasoning steps generated by LLMs.

Methodology and Dataset Construction

BRIGHT comprises 1,398 real-world queries spanning diverse domains such as economics, psychology, robotics, software engineering, and earth sciences. The dataset is derived from human-curated and naturally occurring data. Notably, seven datasets were compiled from StackExchange, where real user questions were paired with web pages linked from credible answers. Additionally, a coding task involving the rare Pony programming language and other datasets for retrieving STEM theorems or examples were incorporated.

Three essential components characterize the dataset construction:

Selection of Queries: The queries predominantly represent complex, reasoning-intensive real-world scenarios.
Document Collection: Relevant documents were manually annotated to ensure their relevance to the query via detailed reasoning steps. Web pages were split into passages to facilitate relevance.
Hard Negatives: To simulate real-world retrieval challenges, hard-negative documents were created by searching Google using query keywords and human verification to ensure similarity without relevance to the queries.

Evaluation and Results

The benchmark evaluated 13 models, including traditional bag-of-words models like BM25, multiple open-source dense retrieval models, and proprietary models. The results unequivocally demonstrate that even the best-performing model, Qwen, achieves a nDCG@10 score of only 22.1, which underscores the difficulty posed by reasoning-intensive retrieval tasks.

Key findings include:

Model Performance: Traditional sparse models like BM25 performed significantly worse compared to modern dense retrieval models. However, the dense models also struggled, with their performance highlighting the inadequacy of keyword and simple semantic retrieval in these tasks.
LLM Query Augmentation: Introducing reasoning steps generated by LLMs, specifically Llama-3-70B and GPT-4, for forming retrieval queries significantly improved performance, albeit with the best scores still remaining underwhelming (sub-30 nDCG@10).
Reranking with LLMs: Using LLMs like Gemini and GPT-4 for reranking improved performance by up to 3.1 points, showing that while LLMs can scrutinize document relevance effectively, they are not a panacea for the intrinsic retrieval challenges.

Robustness and Long-context Retrieval

The authors also examined the robustness of the benchmark against data leakage from pretraining and explored retrieval tasks involving long-document contexts. Continuing training on the benchmark data did not yield significant improvements, validating the robustness of BRIGHT. In long-document settings, retrieval tasks remained challenging despite the reduced search space, with the best model achieving only a 27.8 recall@1 score.

Implications and Future Work

The results suggest that the current state-of-the-art retrieval systems lack the requisite reasoning capabilities for handling complex, real-world queries effectively. This calls for novel retrieval methodologies that can incorporate multifaceted reasoning into the retrieval process. The benchmark sets a higher bar for future retrieval models and methodologies, encouraging researchers to develop more sophisticated retrieval algorithms that can understand and process nuanced and contextual information effectively.

Looking forward, several avenues appear promising:

Retrieval-Augmented Generation (RAG): The potential to improve RAG models by integrating relevant documents into the context for more accurate and coherent responses.
Multi-modal Retrieval: Incorporating data beyond text, such as images or other data types, can provide a richer retrieval model that can handle a wider range of queries.
Adaptive and Interactive Retrieval Systems: Developing models that interactively refine search criteria based on user feedback, closely mimicking real-world information-seeking behavior.

In conclusion, BRIGHT is a substantial contribution to the retrieval community, pushing the frontier towards more realistic and challenging benchmarks. It lays the groundwork for developing the next generation of retrieval models, capable of deep reasoning and better aligned with human information retrieval needs. The release of this benchmark is likely to inspire further research and innovation in the domain of retrieval systems.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/Muennighoff/status/1814319544107946370

https://twitter.com/mrdrozdov/status/1921221191672430931

https://twitter.com/antoine_chaffin/status/1925566308449489218

https://twitter.com/gm8xx8/status/1814448438656020963

https://twitter.com/realmofresearch/status/1814561027427835983

https://twitter.com/raghavan_anand/status/1820882598765207629

HackerNews

Bright: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval (1 point, 0 comments)