Papers
Topics
Authors
Recent
2000 character limit reached

TruthfulQA: Measuring How Models Mimic Human Falsehoods (2109.07958v2)

Published 8 Sep 2021 in cs.CL, cs.AI, cs.CY, and cs.LG

Abstract: We propose a benchmark to measure whether a LLM is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.

Citations (1,362)

Summary

  • The paper introduces a benchmark using 817 questions across 38 categories to evaluate language model truthfulness.
  • It reveals an inverse scaling pattern where larger models like GPT-3 generate increasingly false responses.
  • It suggests that beyond scaling, techniques like prompt engineering and reinforcement learning from human feedback are needed to enhance accuracy.

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Introduction

The paper "TruthfulQA: Measuring How Models Mimic Human Falsehoods" proposes a benchmark intended to evaluate the truthfulness of LLMs when generating answers to questions. With 817 questions covering 38 categories such as health, law, and politics, TruthfulQA assesses whether models avoid conveying widespread human misconceptions. This benchmark is crucial for understanding the extent to which LLMs have internalized falsehoods from their training data, reflecting a phenomenon where larger models, contrary to expectations, tend to be less truthful.

Benchmark and Methodology

TruthfulQA is designed to measure a model's ability to produce truthful answers as opposed to merely generating typical human-like text. The benchmark includes questions that are crafted specifically to trigger falsehoods learned by models during training. Models like GPT-3, GPT-Neo/J, and UnifiedQA were put to test. Surprisingly, the best-performing model, GPT-3 at its largest size, only achieved a truthfulness rate of 58% compared to a human baseline of 94% (Figure 1). Figure 2

Figure 2: TruthfulQA questions exposing common falsehoods in model-generated answers.

Scaling and Truthfulness

A significant finding is the inverse scaling pattern, where larger LLMs result in less truthful answers. This observation is contrary to the scaling laws typically associated with NLP tasks, where performance improves with model size (Figure 3). The larger models are better at text distribution learning, inadvertently increasing their propensity to mimic human misconceptions, known as "imitative falsehoods." The phenomenon underscores the challenge of achieving truthful LLMs through mere parameter scaling. Figure 3

Figure 3: Larger models, including GPT-3 and GPT-Neo/J, perform worse on truthfulness as model size increases.

Evaluating Truthfulness

To evaluate LLM truthfulness, TruthfulQA uses both automated metrics and human evaluations. One of the metrics, GPT-judge, predicts truthfulness with 90-96% accuracy, aligning closely with human evaluation (Figure 4). GPT-judge is finetuned specifically on assessing truthfulness in LLM outputs, offering a reproducible, cost-effective alternative to human evaluation, though human judgment remains the gold standard for nuanced truthfulness assessment. Figure 4

Figure 4: GPT-judge demonstrates high accuracy against human evaluations in predicting truthfulness.

Implications and Future Work

The findings from TruthfulQA suggest that improving the truthfulness of LLMs is unlikely to be achieved through scaling alone. Alternative strategies, such as prompt engineering, finetuning with objectives other than web text imitation, and integrating information retrieval systems, are necessary. Additionally, fine-tuning a model with reinforcement learning from human feedback shows promise for enhancing model truthfulness. Future research must focus on these strategies to align LLM outputs with factual accuracy and public needs.

Conclusion

TruthfulQA provides a crucial measure of LLM truthfulness, highlighting the failures of current state-of-the-art models to consistently produce accurate responses. The work underlies the importance of developing new techniques beyond model scaling to achieve more reliable models, especially essential for applications demanding high truthfulness, such as in legal, medical, and scientific domains. Despite the challenges, TruthfulQA stands as a useful tool for evaluating current and future models, aiding in the pursuit of truthful AI. Figure 5

Figure 5: Comparison of performance improvements on the TruthfulQA benchmark by new model architectures over the original GPT-3 baseline.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 2269 likes about this paper.