Papers
Topics
Authors
Recent
2000 character limit reached

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall (2404.16164v1)

Published 24 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

Citations (1)

Summary

  • The paper introduces FACT-Bench, a novel benchmark evaluating LLMs' factual knowledge recall using 20,000 QA pairs across 20 domains and 134 property types.
  • The paper finds that larger, pretrained models consistently outperform smaller and instruction-tuned models, emphasizing the benefits of model scaling.
  • The paper reveals that fine-tuning on known knowledge enhances recall, while counterfactual exemplars in ICL degrade performance, highlighting challenges with hallucination.

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Introduction

The paper "Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall" introduces a novel benchmark, FACT-Bench, for evaluating the factual knowledge recall capabilities of LLMs. FACT-Bench is designed to address the limitations of previous benchmarks by covering a wide range of domains, property types, and knowledge popularity levels. The benchmark evaluates the performance of 31 models across 10 model families, providing insights into how different factors, such as instruction tuning and model scaling, affect knowledge recall.

FACT-Bench Overview

Dataset Construction

FACT-Bench is constructed as a closed-book question-answering task, where models are required to answer questions without any external context, relying solely on their internal knowledge. The dataset consists of 20,000 QA pairs, balanced across 20 domains and covering 134 property types. Questions are generated using Wikidata triplets and are designed to be simple, valid, diverse, and specific, ensuring that answers are grounded in verifiable knowledge from Wikipedia.

Evaluation Metrics

The benchmark utilizes standard QA metrics, including Exact Match (EM) and F1 score, to measure performance. Additionally, a "Contains" metric is introduced to account for verbose model outputs that still contain the correct answer. Human validation is conducted on a subset of the dataset to ensure high quality and specificity of the questions.

Benchmarking Analysis

Model Performance

The evaluation reveals several important insights:

  • Instruction Tuning: Models that are only pretrained generally outperform their instruction-tuned counterparts in terms of factual knowledge recall. This suggests that instruction tuning may align models towards specific task formats at the expense of recall capability.
  • Model Scaling: Larger models consistently outperform smaller ones within the same family, indicating the positive effect of model scaling on knowledge recall.
  • Gap with Upper-Bound: There remains a significant gap between the best-performing model (e.g., GPT-4) and human performance, highlighting the ongoing challenge of mastering factual knowledge in LLMs.

Fine-Grained Evaluation

Further analysis reveals that knowledge popularity and property type are strong predictors of recall performance. Models struggle with long-tail entities and certain complex property types. Surprisingly, recall performance is relatively stable across different domains, suggesting that domain-specific knowledge might be less challenging for models than initially thought.

In-Context Learning and Fine-Tuning

Counterfactual ICL Experiments

The study investigates the impact of counterfactual exemplars in in-context learning (ICL) scenarios. It finds that large models suffer significant degradation in recall when exposed to exemplars that contradict known knowledge, emphasizing the importance of exemplar factuality.

Fine-Tuning Effects

Fine-tuning experiments demonstrate that training on known knowledge yields better performance than mixed or unknown knowledge. This supports the hypothesis that fine-tuning on knowledge unknown to the model can lead to hallucination, thereby undermining the model's reliability.

Conclusion

The introduction of FACT-Bench provides a robust framework for evaluating LLMs' factual knowledge recall across a diverse set of conditions. The findings underscore the importance of careful tuning and the selection of training data to enhance the factual accuracy of LLMs. Future research could focus on bridging the gap between current model performance and the human-level upper-bound, possibly by improving model architectures or training methodologies.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 0 likes about this paper.