Emergent Mind

CRAG -- Comprehensive RAG Benchmark

(2406.04744)
Published Jun 7, 2024 in cs.CL

Abstract

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.

QA using LLMs without vs. with Retrieval-Augmented Generation (RAG).

Overview

  • The paper introduces the Comprehensive RAG Benchmark (CRAG), designed to address the limitations in retrieval-augmented generation (RAG) systems for LLMs, particularly regarding hallucinations and knowledge deficiencies.

  • CRAG's dataset comprises 4,409 question-answer pairs across five domains, featuring diverse question categories and simulating real-world searches through mock APIs, including up to 50 HTML pages per question and Knowledge Graphs containing 2.6 million entities.

  • The benchmarking reveals current RAG systems' performance, showing significant room for improvement with accuracy enhancements from 34% to 44% using RAG, while highlighting persistent issues like hallucination rates of 17% and challenges with dynamic data and low-popularity entities.

Comprehensive RAG Benchmark (CRAG)

Retrieval-Augmented Generation (RAG) systems have emerged as a salient methodology to combat knowledge deficiencies in LLMs. The paper "CRAG - Comprehensive RAG Benchmark" by Yang et al. introduces the Comprehensive RAG Benchmark (CRAG) as a solution to the inadequacies of existing RAG datasets. This benchmark is meticulously designed to capture the diverse and dynamic aspects of real-world Question Answering (QA) tasks, ensuring that the current limitations in LLMs, especially with hallucination issues, are critically addressed.

Dataset Composition and Characteristics

CRAG incorporates a robust dataset of 4,409 question-answer pairs, spanning across five domains: Finance, Sports, Music, Movies, and Open domain. The dataset is rich in diversity, covering eight distinct question categories: Simple, Simple with Condition, Comparison, Aggregation, Multi-hop, Set, Post-processing-heavy, and False-premise questions. This wide spectrum allows CRAG to reflect varied entity popularity from head to tail entities and encompasses temporal dynamism from static facts to those changing within seconds.

For context retrieval, CRAG includes mock APIs to simulate real-world web and Knowledge Graph (KG) searches. The dataset is designed to include up to 50 HTML pages per question from a real search engine, Brave Search API, and mock KGs with 2.6 million entities. This dual approach ensures that the retrieved content is realistic, thus providing a complex yet authentic environment for testing RAG systems.

Evaluation Metrics and Benchmarks

The evaluation of RAG systems on CRAG is structured around three tasks: Retrieval Summarization, KG and Web Retrieval Augmentation, and End-to-end RAG. These tasks cater to the incremental complexity and diverse retrieval sources, ensuring comprehensive assessment across different dimensions of RAG solutions.

The evaluation metrics focus on accuracy, hallucination, and missing answers, providing a scoring system that penalizes hallucinated answers while rewarding accurate ones. Both human evaluation (Scoreh) and model-based automatic evaluation (Scorea) are employed, with the latter harnessing LLMs such as ChatGPT and Llama 3 for rapid assessment. This dual evaluation strategy ensures robustness and reliability in judging the performance of RAG systems.

Baseline Performance and Insights

Initial benchmarking with straightforward RAG solutions shows that state-of-the-art LLMs achieve a meager accuracy of up to 34% when deployed without RAG augmentation. However, incorporating RAG straightforwardly increases accuracy to 44%, with empirical studies emphasizing the critical need to mitigate hallucination introduced by irrelevant retrieval results. The benchmarks highlight notable deficits in handling dynamism and lower-popularity entities, pointing to future research directions for more effective information synthesis and retrieval filtering.

State-of-the-art industry RAG systems demonstrate better performance, with the highest accuracy reaching 63% for perfect answers. Nonetheless, the persistent hallucination rates, even at 17%, indicate substantial room for improvement. The evaluation reveals that RAG systems struggle notably with real-time and fast-changing queries, as well as queries requiring complex reasoning or dealing with less popular entities.

Implications and Future Research

CRAG's contribution to RAG research is multifaceted. Practically, it provides a realistic and comprehensive benchmark for developing more trustworthy QA systems. Theoretically, it lays bare the intricacies in handling dynamic and diverse information sources, pushing the envelope in what RAG systems can achieve.

Moving forward, research could delve into more sophisticated ranking mechanisms for retrieval results, advanced synthesis algorithms that better handle noisy data, and the integration of richer contextual understanding in LLMs. Extending CRAG to include multi-lingual and multi-modal questions could further enhance its applicability and challenge spectrum.

By maintaining and expanding CRAG, the research community is well-equipped to drive advancements in RAG solutions, ensuring that they are well-aligned with real-world demands and complexities.

Conclusion

The CRAG benchmark, as elaborated in this paper, embodies a significant step towards advancing retrieval-augmented generation systems. Through a comprehensive and diverse dataset, meticulous evaluation protocols, and insightful initial benchmarks, CRAG addresses existing gaps in RAG research and paves the way for future innovations. The ongoing commitment to enhancing CRAG guarantees its role as a pivotal resource for driving reliable and effective QA systems in the vast landscape of AI research.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.