Evaluating the Factual Consistency of Large Language Models Through News Summarization

Published 15 Nov 2022 in cs.CL | (2211.08412v2)

Abstract: While LLMs have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. Specifically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB, we evaluate 23 LLMs ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our code and benchmark data can be found at https://github.com/r-three/fib.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (83)

View on Semantic Scholar

Summary

The paper introduces the FIB benchmark to measure factual consistency in LLM-generated news summaries, demonstrating that most models favor factually accurate outputs.
It employs a comprehensive evaluation of 23 models across multiple families using manually annotated factually inconsistent summaries to assess performance.
Findings reveal that while larger models generally improve consistency, LLMs can mistakenly favor verbatim excerpts, highlighting a critical challenge in summarization.

Evaluating the Factual Consistency of LLMs Through News Summarization

The paper "Evaluating the Factual Consistency of LLMs Through News Summarization" provides an exploration of the propensity of LLMs to maintain factual consistency when generating news summaries. This research is particularly relevant in the context of natural language generation (NLG), where LLMs, while advanced in many respects, are known to exhibit hallucinatory behavior—that is, generating information not present in the source material.

To address this, the authors present the Factual Inconsistency Benchmark (FIB), a new benchmark designed to evaluate whether LLMs prefer factually consistent document continuations over inconsistent ones. The benchmark evaluates model performance by comparing the scores that LLMs assign to human-verified factually consistent summaries versus factually inconsistent summaries generated from various summarization models.

The authors conducted a comprehensive evaluation involving 23 LLMs, ranging from 1 billion to 176 billion parameters, sourced from six model families including BLOOM and OPT. The findings indicate that LLMs generally favor factually consistent summaries, assigning higher scores to them compared to factually inconsistent ones. However, a notable exception was observed: when factually inconsistent summaries contained verbatim content from the input document, LLMs displayed a higher tendency to favor these inconsistent summaries.

Noteworthy is the methodology for creating factually inconsistent summaries: they are generated via 22 models and subsequently annotated manually. The FIB is built upon summaries from the XSum and CNN/DM datasets, offering a robust testbed for abstractive and extractive summarization tasks, respectively.

The research offers several key insights:

Factual Consistency Preference: LLMs generally demonstrate a bias towards consistent summaries. For instance, BLOOM shows an adherence to this preference 72.4% of the time.
Verbatim Pitfalls: Despite this general preference, LLMs rarely prefer consistent summaries over inconsistent ones if the latter are extracted verbatim from input documents, exemplified by BLOOM's mere 9.6% preference rate in these scenarios.
Scale and Consistency: There is an observed trend of increasing factual consistency with the scale of the model parameters.
FactCC Efficacy: FactCC-generated factually inconsistent summaries pose a significant challenge to some LLMs, as they are often rated similarly to manually generated inconsistent summaries.

The paper contributes valuable tools and methodologies for assessing the factuality of LLMs, offering a critical view of their performance across models with different sizes and pretraining paradigms. The research paves the way for future work aimed at improving LLMs' handling of factual information, potentially extending these methodologies to other domains such as scientific literature and QA systems.

The benchmark and findings highlight the necessity for further advancements in ensuring LLMs' outputs are not only coherent but factually accurate, thus enhancing the reliability of AI-driven tasks in real-world applications. This work also emphasizes the importance of developing refined techniques for LLM evaluation, particularly those that incorporate nuanced mechanisms such as pointwise mutual information to better gauge model performance.

Markdown Report Issue