- The paper introduces FACT-Bench, a novel benchmark evaluating LLMs' factual knowledge recall using 20,000 QA pairs across 20 domains and 134 property types.
- The paper finds that larger, pretrained models consistently outperform smaller and instruction-tuned models, emphasizing the benefits of model scaling.
- The paper reveals that fine-tuning on known knowledge enhances recall, while counterfactual exemplars in ICL degrade performance, highlighting challenges with hallucination.
Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall
Introduction
The paper "Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall" introduces a novel benchmark, FACT-Bench, for evaluating the factual knowledge recall capabilities of LLMs. FACT-Bench is designed to address the limitations of previous benchmarks by covering a wide range of domains, property types, and knowledge popularity levels. The benchmark evaluates the performance of 31 models across 10 model families, providing insights into how different factors, such as instruction tuning and model scaling, affect knowledge recall.
FACT-Bench Overview
Dataset Construction
FACT-Bench is constructed as a closed-book question-answering task, where models are required to answer questions without any external context, relying solely on their internal knowledge. The dataset consists of 20,000 QA pairs, balanced across 20 domains and covering 134 property types. Questions are generated using Wikidata triplets and are designed to be simple, valid, diverse, and specific, ensuring that answers are grounded in verifiable knowledge from Wikipedia.
Evaluation Metrics
The benchmark utilizes standard QA metrics, including Exact Match (EM) and F1 score, to measure performance. Additionally, a "Contains" metric is introduced to account for verbose model outputs that still contain the correct answer. Human validation is conducted on a subset of the dataset to ensure high quality and specificity of the questions.
Benchmarking Analysis
The evaluation reveals several important insights:
- Instruction Tuning: Models that are only pretrained generally outperform their instruction-tuned counterparts in terms of factual knowledge recall. This suggests that instruction tuning may align models towards specific task formats at the expense of recall capability.
- Model Scaling: Larger models consistently outperform smaller ones within the same family, indicating the positive effect of model scaling on knowledge recall.
- Gap with Upper-Bound: There remains a significant gap between the best-performing model (e.g., GPT-4) and human performance, highlighting the ongoing challenge of mastering factual knowledge in LLMs.
Fine-Grained Evaluation
Further analysis reveals that knowledge popularity and property type are strong predictors of recall performance. Models struggle with long-tail entities and certain complex property types. Surprisingly, recall performance is relatively stable across different domains, suggesting that domain-specific knowledge might be less challenging for models than initially thought.
In-Context Learning and Fine-Tuning
Counterfactual ICL Experiments
The study investigates the impact of counterfactual exemplars in in-context learning (ICL) scenarios. It finds that large models suffer significant degradation in recall when exposed to exemplars that contradict known knowledge, emphasizing the importance of exemplar factuality.
Fine-Tuning Effects
Fine-tuning experiments demonstrate that training on known knowledge yields better performance than mixed or unknown knowledge. This supports the hypothesis that fine-tuning on knowledge unknown to the model can lead to hallucination, thereby undermining the model's reliability.
Conclusion
The introduction of FACT-Bench provides a robust framework for evaluating LLMs' factual knowledge recall across a diverse set of conditions. The findings underscore the importance of careful tuning and the selection of training data to enhance the factual accuracy of LLMs. Future research could focus on bridging the gap between current model performance and the human-level upper-bound, possibly by improving model architectures or training methodologies.