- The paper introduces a benchmark using 817 questions across 38 categories to evaluate language model truthfulness.
- It reveals an inverse scaling pattern where larger models like GPT-3 generate increasingly false responses.
- It suggests that beyond scaling, techniques like prompt engineering and reinforcement learning from human feedback are needed to enhance accuracy.
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Introduction
The paper "TruthfulQA: Measuring How Models Mimic Human Falsehoods" proposes a benchmark intended to evaluate the truthfulness of LLMs when generating answers to questions. With 817 questions covering 38 categories such as health, law, and politics, TruthfulQA assesses whether models avoid conveying widespread human misconceptions. This benchmark is crucial for understanding the extent to which LLMs have internalized falsehoods from their training data, reflecting a phenomenon where larger models, contrary to expectations, tend to be less truthful.
Benchmark and Methodology
TruthfulQA is designed to measure a model's ability to produce truthful answers as opposed to merely generating typical human-like text. The benchmark includes questions that are crafted specifically to trigger falsehoods learned by models during training. Models like GPT-3, GPT-Neo/J, and UnifiedQA were put to test. Surprisingly, the best-performing model, GPT-3 at its largest size, only achieved a truthfulness rate of 58% compared to a human baseline of 94% (Figure 1).
Figure 2: TruthfulQA questions exposing common falsehoods in model-generated answers.
Scaling and Truthfulness
A significant finding is the inverse scaling pattern, where larger LLMs result in less truthful answers. This observation is contrary to the scaling laws typically associated with NLP tasks, where performance improves with model size (Figure 3). The larger models are better at text distribution learning, inadvertently increasing their propensity to mimic human misconceptions, known as "imitative falsehoods." The phenomenon underscores the challenge of achieving truthful LLMs through mere parameter scaling.
Figure 3: Larger models, including GPT-3 and GPT-Neo/J, perform worse on truthfulness as model size increases.
Evaluating Truthfulness
To evaluate LLM truthfulness, TruthfulQA uses both automated metrics and human evaluations. One of the metrics, GPT-judge, predicts truthfulness with 90-96% accuracy, aligning closely with human evaluation (Figure 4). GPT-judge is finetuned specifically on assessing truthfulness in LLM outputs, offering a reproducible, cost-effective alternative to human evaluation, though human judgment remains the gold standard for nuanced truthfulness assessment.
Figure 4: GPT-judge demonstrates high accuracy against human evaluations in predicting truthfulness.
Implications and Future Work
The findings from TruthfulQA suggest that improving the truthfulness of LLMs is unlikely to be achieved through scaling alone. Alternative strategies, such as prompt engineering, finetuning with objectives other than web text imitation, and integrating information retrieval systems, are necessary. Additionally, fine-tuning a model with reinforcement learning from human feedback shows promise for enhancing model truthfulness. Future research must focus on these strategies to align LLM outputs with factual accuracy and public needs.
Conclusion
TruthfulQA provides a crucial measure of LLM truthfulness, highlighting the failures of current state-of-the-art models to consistently produce accurate responses. The work underlies the importance of developing new techniques beyond model scaling to achieve more reliable models, especially essential for applications demanding high truthfulness, such as in legal, medical, and scientific domains. Despite the challenges, TruthfulQA stands as a useful tool for evaluating current and future models, aiding in the pursuit of truthful AI.
Figure 5: Comparison of performance improvements on the TruthfulQA benchmark by new model architectures over the original GPT-3 baseline.