FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (2305.14251v2)

Published 23 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Evaluating the factuality of long-form text generated by large LMs is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong LLM, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via pip install factscore.

References (68)

Citations (447)

View on Semantic Scholar

Summary

The paper presents FActScore, a metric that breaks down text into atomic facts and verifies them against Wikipedia for precise factual evaluation.
The methodology combines human assessments with an automated model achieving a 2% error rate, ensuring scalable and reliable evaluation of text accuracy.
Experimental results reveal that GPT-4 and ChatGPT outperform public models, highlighting FActScore’s potential as a benchmark for LLM factual precision.

An Overview of FActScore: Evaluating Factual Precision in Text Generation

The paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation" addresses a significant challenge in the field of NLP: the evaluation of the factual correctness of texts generated by LLMs. It proposes a novel metric, FActScore, to address the inadequacies found in traditional binary evaluation methods and the constraints associated with human evaluations.

Problem Statement and Motivation

In text generation tasks, especially with long-form content, LLMs often produce a mix of correct and incorrect information. Binary measures, which judge the entire output as either factually correct or incorrect, fail to capture this nuance. Moreover, relying on human evaluators can be both time-consuming and financially prohibitive. The paper presents FActScore as a metric designed to quantify factual precision with finer granularity, thus providing more informative feedback about LLM outputs.

Methodology: FActScore Framework

FActScore disassembles a generated text into atomic facts and checks each piece against a reliable knowledge base, specifically the English Wikipedia. This scoring method provides a percentage of facts supported by this knowledge base, which gives a clearer picture of the factual reliability of the generated text. The authors conducted extensive human evaluations using this metric on biographies generated by leading NLP models such as InstructGPT, ChatGPT, and PerplexityAI, demonstrating significant differences in their factual accuracy. For instance, ChatGPT achieved an accuracy of only 58%.

In recognition of the need for scalable solutions, the authors also developed an automated model to estimate FActScore, achieving an error rate within 2% when compared to human evaluations. It uses a combination of retrieval techniques and evaluation by a potent LLM, enabling large-scale evaluation without prohibitive costs. This approach was applied to 6,500 text generations across multiple LLMs, revealing insights such as GPT-4 and ChatGPT's superiority in factual accuracy over public models like Vicuna and Alpaca.

Experimental Results and Implications

A key finding of this research was the variability in factual accuracy among different models. For high-profile models such as GPT-4, the paper showed that factual precision exceeded that of several publicly available counterparts, indicating a potential benchmark for future developments in text generation. These insights are crucial for both developers of LLMs and users who require reliable information synthesis from these models.

Implications and Future Directions

The introduction of FActScore has significant implications for the field of AI and NLP. Practically, it provides a tool that can be instrumental in assessing and improving the reliability of text generated by LLMs. Theoretically, it opens avenues for further research into fine-grained evaluation metrics and the development of models that can autonomously understand and enhance their factual accuracy.

The paper also highlights the potential for similar metrics in assessing other qualitative attributes of text generation, such as coherence and relevance. Future research could explore the adaptation of FActScore to other domains beyond biographies or experiment with different knowledge bases to accommodate various languages and cultural contexts.

In conclusion, while addressing the critical challenge of factual evaluation in text generation, the paper provides a framework that both practitioners and researchers can leverage to improve the factual accuracy of AI-generated content. The proposed methodologies and findings not only serve current technology but also lay a foundation for upcoming innovations in the field of LLMs.