Emergent Mind

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

(2407.12883)
Published Jul 16, 2024 in cs.CL , cs.AI , and cs.IR

Abstract

Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. BRIGHT is constructed from the 1,398 real-world queries collected from diverse domains (such as economics, psychology, robotics, software engineering, earth sciences, etc.), sourced from naturally occurring or carefully curated human data. Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard [38 ], which achieves a score of 59.0 nDCG@10,2 produces a score of nDCG@10 of 18.0 on BRIGHT. We further demonstrate that augmenting queries with Chain-of-Thought reasoning generated by LLMs improves performance by up to 12.2 points. Moreover, BRIGHT is robust against data leakage during pretraining of the benchmarked models as we validate by showing similar performance even when documents from the benchmark are included in the training data. We believe that BRIGHT paves the way for future research on retrieval systems in more realistic and challenging settings. Our code and data are available at https://brightbenchmark.github.io.

Keyword-based and semantic-based retrieval levels in benchmarks, e.g., NQ and MS MARCO datasets.

Overview

  • The paper introduces BRIGHT, a benchmark designed for evaluating retrieval systems on reasoning-intensive tasks, departing from traditional keyword and semantic matching benchmarks.

  • BRIGHT comprises a dataset with 1,398 real-world queries from diverse domains and includes human-curated data from StackExchange and other sources, emphasizing complex, reasoning-intensive scenarios.

  • Evaluation of 13 retrieval models shows even state-of-the-art models struggle with this benchmark, encouraging the development of more sophisticated retrieval algorithms incorporating multifaceted reasoning.

A Formal Analysis of "A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval"

The paper presents a comprehensive benchmark, referred to as BRIGHT, specifically designed for evaluating retrieval systems on tasks that necessitate intensive reasoning. This endeavor demonstrates a shift from the traditional focus on keyword and semantic matching evident in existing benchmarks, such as Natural Questions, MS MARCO, and BEIR. The paper provides a detailed methodology for constructing this benchmark from naturally occurring data across varied domains, an extensive evaluation using state-of-the-art models, and presents new paradigms for augmenting retrieval tasks with reasoning steps generated by LLMs.

Methodology and Dataset Construction

BRIGHT comprises 1,398 real-world queries spanning diverse domains such as economics, psychology, robotics, software engineering, and earth sciences. The dataset is derived from human-curated and naturally occurring data. Notably, seven datasets were compiled from StackExchange, where real user questions were paired with web pages linked from credible answers. Additionally, a coding task involving the rare Pony programming language and other datasets for retrieving STEM theorems or examples were incorporated.

Three essential components characterize the dataset construction:

  1. Selection of Queries: The queries predominantly represent complex, reasoning-intensive real-world scenarios.
  2. Document Collection: Relevant documents were manually annotated to ensure their relevance to the query via detailed reasoning steps. Web pages were split into passages to facilitate relevance.
  3. Hard Negatives: To simulate real-world retrieval challenges, hard-negative documents were created by searching Google using query keywords and human verification to ensure similarity without relevance to the queries.

Evaluation and Results

The benchmark evaluated 13 models, including traditional bag-of-words models like BM25, multiple open-source dense retrieval models, and proprietary models. The results unequivocally demonstrate that even the best-performing model, Qwen, achieves a nDCG@10 score of only 22.1, which underscores the difficulty posed by reasoning-intensive retrieval tasks.

Key findings include:

  • Model Performance: Traditional sparse models like BM25 performed significantly worse compared to modern dense retrieval models. However, the dense models also struggled, with their performance highlighting the inadequacy of keyword and simple semantic retrieval in these tasks.
  • LLM Query Augmentation: Introducing reasoning steps generated by LLMs, specifically Llama-3-70B and GPT-4, for forming retrieval queries significantly improved performance, albeit with the best scores still remaining underwhelming (sub-30 nDCG@10).
  • Reranking with LLMs: Using LLMs like Gemini and GPT-4 for reranking improved performance by up to 3.1 points, showing that while LLMs can scrutinize document relevance effectively, they are not a panacea for the intrinsic retrieval challenges.

Robustness and Long-context Retrieval

The authors also examined the robustness of the benchmark against data leakage from pretraining and explored retrieval tasks involving long-document contexts. Continuing training on the benchmark data did not yield significant improvements, validating the robustness of BRIGHT. In long-document settings, retrieval tasks remained challenging despite the reduced search space, with the best model achieving only a 27.8 recall@1 score.

Implications and Future Work

The results suggest that the current state-of-the-art retrieval systems lack the requisite reasoning capabilities for handling complex, real-world queries effectively. This calls for novel retrieval methodologies that can incorporate multifaceted reasoning into the retrieval process. The benchmark sets a higher bar for future retrieval models and methodologies, encouraging researchers to develop more sophisticated retrieval algorithms that can understand and process nuanced and contextual information effectively.

Looking forward, several avenues appear promising:

  • Retrieval-Augmented Generation (RAG): The potential to improve RAG models by integrating relevant documents into the context for more accurate and coherent responses.
  • Multi-modal Retrieval: Incorporating data beyond text, such as images or other data types, can provide a richer retrieval model that can handle a wider range of queries.
  • Adaptive and Interactive Retrieval Systems: Developing models that interactively refine search criteria based on user feedback, closely mimicking real-world information-seeking behavior.

In conclusion, BRIGHT is a substantial contribution to the retrieval community, pushing the frontier towards more realistic and challenging benchmarks. It lays the groundwork for developing the next generation of retrieval models, capable of deep reasoning and better aligned with human information retrieval needs. The release of this benchmark is likely to inspire further research and innovation in the domain of retrieval systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.