CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models (2401.17043v3)

Published 30 Jan 2024 in cs.CL

Abstract: Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of LLMs by incorporating external knowledge sources. This method addresses common LLM limitations, including outdated information and the tendency to produce inaccurate "hallucinated" content. However, the evaluation of RAG systems is challenging, as existing benchmarks are limited in scope and diversity. Most of the current benchmarks predominantly assess question-answering applications, overlooking the broader spectrum of situations where RAG could prove advantageous. Moreover, they only evaluate the performance of the LLM component of the RAG pipeline in the experiments, and neglect the influence of the retrieval component and the external knowledge database. To address these issues, this paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios. Specifically, we have categorized the range of RAG applications into four distinct types-Create, Read, Update, and Delete (CRUD), each representing a unique use case. "Create" refers to scenarios requiring the generation of original, varied content. "Read" involves responding to intricate questions in knowledge-intensive situations. "Update" focuses on revising and rectifying inaccuracies or inconsistencies in pre-existing texts. "Delete" pertains to the task of summarizing extensive texts into more concise forms. For each of these CRUD categories, we have developed comprehensive datasets to evaluate the performance of RAG systems. We also analyze the effects of various components of the RAG system, such as the retriever, the context length, the knowledge base construction, and the LLM. Finally, we provide useful insights for optimizing the RAG technology for different scenarios.

Citations (17)

View on Semantic Scholar

Summary

The paper introduces CRUD-RAG, a novel benchmark that evaluates RAG systems using distinct Chinese datasets for Create, Read, Update, and Delete tasks.
It employs large-scale Chinese news datasets and comprehensive metrics like BLEU, ROUGE, and RAGQuestEval to assess retrieval and generation performance.
Experiments highlight optimal settings such as larger top-k values and GPT-4’s superior performance, guiding advancements in RAG systems.

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of LLMs

Introduction

The paper "CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of LLMs" introduces a novel benchmark designed to evaluate Retrieval-Augmented Generation (RAG) systems in diverse application scenarios. The RAG framework enhances LLMs by integrating external knowledge, addressing limitations such as outdated information and hallucinated content. However, existing benchmarks focus predominantly on question-answering tasks, overlooking other applications, and often assess only the LLM component. This paper addresses these gaps with a benchmark encompassing Create, Read, Update, and Delete (CRUD) categories, reflecting the variety of RAG applications.

Benchmark Overview

CRUD-RAG categorizes RAG tasks into CRUD actions: Create, Read, Update, and Delete, each representing different interactions with knowledge bases:

Create: Involves generating original content by enriching input text with relevant external information.
Read: Focuses on extracting and synthesizing knowledge to address complex questions and problems.
Update: Involves correcting or revising existing information for accuracy and consistency.
Delete: Concerns summarizing or simplifying information by removing redundancy.
Figure 1: We have classified the application scenarios of RAG into four primary aspects: Create, Read, Update, and Delete. The figure provides an illustrative example for each category, showcasing the wide-ranging potential of RAG technology.

The CRUD-RAG benchmark provides a comprehensive evaluation framework for RAG systems across these scenarios, with dedicated datasets and tasks for each category. These datasets are crucial for assessing the performance of various RAG components, such as retrieval algorithms and LLMs.

Datasets and Evaluation

The paper constructs large-scale datasets covering multiple tasks: text continuation, single and multi-document question answering, hallucination modification, and multi-document summarization. These datasets are generated from high-quality Chinese news articles, ensuring that they challenge the RAG system's ability to integrate external information rather than rely solely on internal knowledge.

Figure 2: Illustration of CRUD-RAG, our comprehensive Chinese benchmark for RAG. It classifies the RAG application scenarios into four categories: create, read, update, and delete. For each category, we create appropriate evaluation tasks and datasets. In the experiments, we evaluate various components of the RAG system using our benchmarks.

Evaluation employs both overall semantic similarity metrics (e.g., BLEU, ROUGE, BERTScore) and key information metrics (RAGQuestEval) to comprehensively assess the quality of RAG-generated outputs. RAGQuestEval, an adaptation of existing QA-based evaluation metrics, measures the factual consistency by comparing generated responses with ground truth answers.

Experimental Analysis

The paper presents a meticulous experimental analysis of CRUD-RAG, exploring how each RAG system component—chunk size, chunk overlap, embedding model, retriever, top-k value, and LLM selection—impacts performance:

Chunk Size and Overlap: Larger chunks preserve document structure crucial for certain tasks, while appropriate overlap maintains coherence.
Retrievers: Dense retrievals outperform BM25 for tasks requiring semantic understanding; hybrid retrieval with reranking achieves superior results.
Embedding Models: Performance varies by task, with m3e-base excelling in correction tasks despite ranking lower on retrieval benchmarks.
Top-k: Larger top-k values enhance content diversity, beneficial for creative tasks while balancing between precision and recall in other scenarios.
LLM Selection: GPT-4 consistently excels, outperforming other models across most metrics, but alternatives like Qwen-14B and Baichuan2-13B offer competitive performance in specific tasks.

Conclusion

This paper introduces CRUD-RAG, a pioneering benchmark for evaluating RAG systems, emphasizing the integration of external knowledge in LLMs across various scenarios. By incorporating diverse CRUD tasks and extensive datasets, CRUD-RAG provides critical insights for fine-tuning LLMs and RAG components, advancing the development of retrieval-augmented LLMs. The benchmark framework and its findings offer a significant contribution towards understanding and optimizing RAG systems, guiding future research in improving text generation and enhancing the fidelity of information retrieval.