FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization (2005.03754v1)

Published 7 May 2020 in cs.CL

Abstract: Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.

Citations (366)

View on Semantic Scholar

Summary

The paper introduces FEQA, a framework that uses question-answering to evaluate the faithfulness of abstractive summaries.
The researchers demonstrate that FEQA correlates strongly with human judgments, outperforming traditional evaluation metrics.
The framework offers practical insights for building more reliable summarization systems and potentially extends to other NLP applications.

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

The paper entitled "FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization" explores the critical issue of faithfulness in the context of abstractive summarization. Authored by Esin Durmus, He He, and Mona Diab, the research introduces an innovative framework designed to evaluate the faithfulness of machine-generated abstracts by employing a question-answering (QA) model-based approach.

In the field of abstractive summarization, maintaining factual accuracy while generating coherent summaries poses significant challenges. Traditional evaluation frameworks often lack the capacity to effectively distinguish between faithful and unfaithful information in summaries. Addressing this concern, the authors propose FEQA, a framework which relies on QA models to assess whether a summary accurately represents its source document. By generating questions based on the content of the summary and checking for fidelity to the original text through the QA model, FEQA provides an automated and systematic method to ascertain faithfulness.

The framework's primary advantage is its ability to use QA models as a proxy to indirectly examine summarization content's alignment with its source. The authors applied this method to benchmark datasets, demonstrating that FEQA correlates well with human judgments on faithfulness. This indicates that the proposed methodology is a reliable proxy for human assessments, offering a scalable solution for evaluating summarization models in practice.

Significantly, the results from different experimental setups suggest that FEQA outperforms several existing metrics in terms of aligning with human evaluations. By illustrating various scenarios where traditional metrics fall short, the research underscores the potential of employing QA-based evaluation as a more nuanced and precise measure of summary faithfulness.

The implications of this research are twofold. Practically, it provides a tool for the development and refinement of more accurate summarization systems, which are crucial for applications involving sensitive or critical information dissemination. Theoretically, it encourages the integration of QA techniques into textual evaluation frameworks, promoting a more holistic assessment of natural language processing tasks beyond mere surface-level metrics.

Future developments may advance this framework by integrating additional contextual or semantic layers into the QA evaluation process, potentially addressing current limitations related to ambiguous or context-dependent summaries. Furthermore, as QA models evolve and improve, the effectiveness and reliability of the FEQA framework are anticipated to enhance correspondingly, paving the way for its potential application to domains beyond summarization, including document-grounded dialogue systems and other content-generation tasks.

In conclusion, the paper presents a meticulous and detailed exploration of the challenges in evaluating faithfulness in abstractive summarization and offers an innovative solution that significantly advances the field. The further refinement and adoption of frameworks such as FEQA are expected to contribute to more robust, accurate, and reliable natural language processing systems.