Emergent Mind

RAGAS: Automated Evaluation of Retrieval Augmented Generation

(2309.15217)
Published Sep 26, 2023 in cs.CL

Abstract

We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With RAGAs, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

Overview

  • RAGAs introduces a suite of metrics allowing reference-free assessment of Retrieval Augmented Generation systems.

  • The evaluation framework focuses on faithfulness, answer relevance, and context relevance in RAG systems.

  • RAGAs uses a prompt-based approach with a Large Language Model (LLM) to analyze generated answers and their context.

  • Validation was performed using the WikiEval dataset, showing high agreement with human judgments in faithfulness and answer relevance.

  • RAGAs offers a means for developers to efficiently evaluate and iteratively improve RAG systems.

Introduction

The evaluation of Retrieval Augmented Generation (RAG) systems presents numerous challenges, as it necessitates the consideration of various dimensions: the selection of context by the retrieval system, the generation module's usage of this context, and the quality of resulting generations. Addressing these complications, RAGAs provides a multifaceted suite of metrics for reference-free assessment, enabling the comprehensive evaluation of RAG systems without dependence on human annotations.

Evaluation Framework

The RAGAs framework tackles three pivotal axes of RAG evaluation: faithfulness, answer relevance, and context relevance. Faithfulness measures if the generated answer is inferable from the context, reducing the risk of hallucination. Answer relevance assesses the degree to which the generated answer satisfies the query without the necessity of context factuality. Context relevance evaluates whether the retrieved context passages contain focused information relevant to the input question, without overloading the generation module with extraneous content.

Methodology

RAGAs performs its analysis through a prompt-based approach using an LLM. For examining faithfulness, the framework decomposes the generated answers into assertions and verifies if these could be deduced from the retrieved context. Assessing answer relevance involves generating potential questions from the answer and measuring their semantic alignment with the original question. Context relevance is gauged by determining the proportion of crucial information extracted from the provided context in relation to the question at hand.

Validation and Results

The validation of the RAGAs framework leveraged a novel dataset termed WikiEval, featuring annotated question-context-answer triples evaluated along the metrics of faithfulness, answer, and context relevance. Benchmarked against human judgments, RAGAs demonstrated high concurrence, particularly in discerning faithfulness and answer relevance. Context relevance proved challenging due to the complexity of identifying key sentences in lengthier contexts, a task for which ChatGPT occasionally underperformed.

Conclusion

RAGAs offers an automated and efficient means to evaluate RAG systems with respect to the essential quality dimensions of the generated content. Its integration and compatibility with prevalent RAG-development frameworks make RAGAs a practical tool for developers, facilitating quicker cycles of iteration and improvement in RAG deployments, and thus pushing the frontier of language understanding in AI interfaces.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.