AlignScore: Evaluating Factual Consistency with a Unified Alignment Function (2305.16739v1)

Published 26 May 2023 in cs.CL

Abstract: Many text generation applications require the generated text to be factually consistent with input information. Automatic evaluation of factual consistency is challenging. Previous work has developed various metrics that often depend on specific functions, such as natural language inference (NLI) or question answering (QA), trained on limited data. Those metrics thus can hardly assess diverse factual inconsistencies (e.g., contradictions, hallucinations) that occur in varying inputs/outputs (e.g., sentences, documents) from different tasks. In this paper, we propose AlignScore, a new holistic metric that applies to a variety of factual inconsistency scenarios as above. AlignScore is based on a general function of information alignment between two arbitrary text pieces. Crucially, we develop a unified training framework of the alignment function by integrating a large diversity of data sources, resulting in 4.7M training examples from 7 well-established tasks (NLI, QA, paraphrasing, fact verification, information retrieval, semantic similarity, and summarization). We conduct extensive experiments on large-scale benchmarks including 22 evaluation datasets, where 19 of the datasets were never seen in the alignment training. AlignScore achieves substantial improvement over a wide range of previous metrics. Moreover, AlignScore (355M parameters) matches or even outperforms metrics based on ChatGPT and GPT-4 that are orders of magnitude larger.

Citations (139)

View on Semantic Scholar

Summary

The paper introduces AlignScore, a novel metric relying on a unified alignment function to assess factual consistency in text generation.
It leverages 4.7 million examples from 15 datasets, significantly enhancing generalizability over standard NLI and QA-based metrics.
Experimental results demonstrate that AlignScore can match or outperform larger models like GPT-4 using only 355M parameters.

Evaluating Factual Consistency with A Unified Alignment Function

The paper in question presents a comprehensive approach to evaluating factual consistency in text generation tasks, an essential quality for applications such as summarization and dialogue systems. The central contribution of the study is the introduction of a novel metric, denoted by , which is based on a unified information alignment function. This metric addresses the common challenge encountered with natural language generation systems, where the output often contains factual inconsistencies, such as contradictions or hallucinations, relative to the input context.

Overview

The authors critique existing metrics for factual consistency, which often rely on specific pretrained functions configured for narrow data, like Natural Language Inference (NLI) or Question Answering (QA). Because these conventional metrics are trained with limited datasets, they lack the generalizability needed to evaluate a wide spectrum of factual inconsistencies across diverse types of texts and domains.

To address these limitations, the authors propose , a metric that leverages a generalized alignment function to evaluate the factual consistency between two text pieces. This alignment function is trained on an extensive variety of data sources, aggregating 4.7 million training examples from 15 datasets across seven prevalent language tasks, including but not limited to NLI, QA, and summarization. By employing a diverse and extensive training set, the alignment function achieves a broad understanding of factual consistency, enabling it to generalize to a wide array of evaluation scenarios.

Experimental Results

The authors validate their approach with extensive experiments across large-scale benchmarks, including 22 evaluation datasets, of which the majority were not used in the alignment training. demonstrates substantial improvements over existing metrics across different evaluation metrics. Particularly noteworthy is its ability to match or even outperform metrics based on significantly larger models like ChatGPT and GPT-4, with a significantly smaller parameter count (355M), reflecting an efficient deployment of computational resources.

Implications

The implications of this work are significant for the field of AI, particularly in the development and evaluation of natural language generation systems. From a theoretical standpoint, this research supports the notion that a holistic approach to training evaluation metrics on diverse data can significantly enhance their adaptability and accuracy. Practically, this could lead to more reliable and consistent outputs in real-world applications, enhancing trust and user experience in AI-driven systems.

Future Directions

While the presents a promising step forward in factual consistency evaluation, several future directions merit consideration. These include further expansion of language coverage, as the current work is primarily focused on English. Furthermore, exploring the integration of more sophisticated interpretable modeling approaches could enhance the understanding of how and why certain outputs are considered factually consistent, thus aiding in the transparency and ethical development of AI systems.

Overall, the paper makes a significant contribution to the field by providing a robust, scalable, and more generalized approach to assessing factual consistency in text generation, offering a valuable tool for ongoing and future AI research and applications.