Generating Benchmarks for Factuality Evaluation of Language Models

Published 13 Jul 2023 in cs.CL and cs.AI | (2307.06908v2)

Abstract: Before deploying a LLM (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing methods for factuality evaluation of LLM generation focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent domain specific or rare facts. We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements. We use our framework to create three benchmarks: Wiki-FACTOR, News-FACTOR and Expert-FACTOR. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation, as measured by human annotators. We make our data and code publicly available in https://github.com/AI21Labs/factor.

Abstract PDF Upgrade to Chat

Authors (10)

Citations (66)

View on Semantic Scholar

Summary

The paper presents FACTOR, an automated framework that transforms corpora into controlled factuality benchmarks by injecting varied errors.
It evaluates language model performance across encyclopedic, news, and expert domains, showing that retrieval augmentation improves factual accuracy.
The study reveals a misalignment between perplexity and factuality scores, highlighting FACTOR’s role as a complementary metric in model evaluation.

An Analysis of FACTOR: A Factuality Evaluation Approach for LLMs

Introduction

The proliferation of LLMs in various textual applications has necessitated robust mechanisms to evaluate their factual accuracy, particularly before deploying them in critical domains. The paper "Generating Benchmarks for Factuality Evaluation of LLMs" proposes an innovative framework named FACTOR (Factual Assessment via Corpus TransfORmation) to address this need. FACTOR is designed to evaluate the propensity of LLMs to generate factual statements by transforming selected corpora into comprehensive factuality benchmarks. It aims to overcome the limitations of existing evaluation approaches that lack control over the scope of facts assessed, often under-representing rare or domain-specific factual content.

Approach

FACTOR establishes benchmarks through an automated process that transforms corpora by injecting controlled errors into factual statements to create multiple-choice testing scenarios. Each factual statement from a corpus is paired with several non-factual alternatives generated via automatic transformation pipelines. Specifically, the non-factual alternatives are created to introduce various error types such as errors in entity, predicate, circumstance, coreference, and link, ensuring a diverse evaluation landscape.

Results and Evaluation

The FACTOR framework was applied to generate three distinct benchmarks, namely Wiki-FACTOR, News-FACTOR, and Expert-FACTOR, across domains encompassing encyclopedic, current events, and specific expert question-answer datasets. Evaluations of various LLMs revealed certain trends:

Model Performance and Size: As anticipated, model performance on the FACTOR benchmarks generally improved with increasing model size. However, even the largest models faced significant challenges, indicating the stringency and comprehensiveness of the benchmarks.
Retrieval Augmentation: The study highlighted that retrieval-augmented models using IC-RALM (In-Context Retrieval-Augmented LLMs) demonstrated improved factual accuracy. This underscores the potential of retrieval mechanisms in grounding LLMs' responses in factual content, thus enhancing their factuality.

FACTOR vs. Perplexity

A critical insight from the paper was the divergence between perplexity and FACTOR scores when ranking LLMs. Perplexity, a commonly used proxy for LLM capability, occasionally misaligned with factual accuracy as measured by FACTOR. For instance, models with better perplexity scores did not always exhibit superior factual accuracy, thereby establishing FACTOR's role as a complementary metric for factual generation propensity.

Practical and Theoretical Implications

The introduction of FACTOR benchmarks bears significant implications for both model evaluation and development strategies in natural language processing:

Evaluation Benchmarks: FACTOR benchmarks provide a robust framework that can be extended and adapted to various domains requiring fact-grounded textual generation.
Model Training Insights: The creation and use of non-factual variants as part of the FACTOR evaluation highlights the potential of introducing similar datasets in model training phases to bolster factual accuracy.

Future Developments

Future research can explore the integration of FACTOR-style data into training regimes, potentially increasing model sensitivity to factual correctness. Additionally, enhancing the retrieval components and aligning them more closely with factuality evaluation, as FACTOR elucidated, could offer further improvements in fact-aware model outputs.

This study enriches the methodology for assessing the factual accuracy of LLMs, providing an essential toolset to understand and improve LLMs' deployment capabilities in factual terrains. As the landscape of natural language processing evolves, FACTOR can play an instrumental role in shaping ethical and accurate text generation practices across diverse application domains.

Markdown Report Issue