TruthfulQA: Measuring How Models Mimic Human Falsehoods (2109.07958v2)

Published 8 Sep 2021 in cs.CL, cs.AI, cs.CY, and cs.LG

Abstract: We propose a benchmark to measure whether a LLM is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.

Citations (1,362)

View on Semantic Scholar

Summary

The paper introduces the TruthfulQA benchmark with 817 questions across 38 categories to assess language models' propensity for imitative falsehoods.
The paper finds that larger models exhibit inverse scaling, with truthfulness scores dropping from 33% to 21% compared to a 94% human baseline.
The paper employs rigorous human evaluations and a fine-tuned 'GPT-judge' system to reliably validate the truthfulness metrics of various LM architectures.

TruthfulQA: Evaluating the Implicit Falsehoods in LLMs

Recent advancements in LLMs (LMs) have spotlighted their ability to generate fluent text across various applications. However, a less explored facet of these models is their tendency to generate falsehoods, often mirroring human misconceptions. The paper "TruthfulQA: Measuring How Models Mimic Human Falsehoods" by Lin, Hilton, and Evans addresses this issue by presenting a benchmark explicitly designed to evaluate the truthfulness of LMs on a diverse set of questions.

The benchmark, termed TruthfulQA, comprises 817 questions spanning 38 categories, including domains such as health, law, finance, and politics. The questions are crafted to provoke false answers from models, imitating widely-held human misconceptions. This design is intentional, aiming to quantify a model's propensity to generate what the authors call "imitative falsehoods."

Key Findings

Performance of Current Models: Four prominent model architectures—GPT-3, GPT-Neo/J, GPT-2, and UnifiedQA—were evaluated using the TruthfulQA benchmark. The best-performing model achieved a truthfulness score of 58%, a stark contrast to the 94% truthfulness observed in human responses. Notably, larger models within the same architecture tended to perform worse, an inverse scaling phenomenon that contrasts with other NLP tasks where performance generally improves with model size.
Inverse Scaling: The degradation in truthfulness with larger model sizes indicates that these models are better at learning and mimicking the training distribution, which includes prevalent human falsehoods. For instance, GPT-3, when scaling from smaller to larger models, showed a decrease in truthfulness from 33% to 21%.
Evaluation Methodology: The authors employed a rigorous human evaluation methodology, ensuring consistency and objectivity by using qualitative labels to score responses. This approach allowed for nuanced assessments of model outputs, differentiating between outright falsehoods, partial truths, and instances where models appropriately expressed uncertainty.
Automated Evaluation: To complement human evaluations, the authors developed 'GPT-judge', a fine-tuned GPT-3-6.7B model tasked with predicting the truthfulness of generated answers. GPT-judge achieved high accuracy, correlating well with human evaluations and providing a scalable way to measure truthfulness across extensive model outputs.
Comparison with Newer Models: Post-benchmark publication, newer models such as Anthropic, InstructGPT, and WebGPT (leveraging information retrieval) have been tested, showing improved performance. Nevertheless, there remains a significant gap between the best-performing model and human baseline.

Implications

The findings from this paper have broad implications for the future development and deployment of LLMs:

Potential for Misuse: The generation of plausible but false information by larger models underscores the risks of deploying LMs in critical applications like healthcare and legal advice. Without robust mechanisms to ensure truthfulness, these models could propagate misinformation at scale.
Strategies for Improvement: The results suggest that merely scaling up models is insufficient to enhance truthfulness. Alternative approaches, such as fine-tuning with human feedback, prompt engineering, and integrating information retrieval systems, appear more promising. For example, models like InstructGPT show that aligning model outputs with human preferences can significantly improve truthful generation.
Benchmark Utility: The TruthfulQA benchmark serves as a valuable tool for evaluating and guiding improvements in LLM performance. It highlights the necessity for benchmarks that go beyond typical NLP tasks to address more nuanced and domain-specific criteria like truthfulness.

Future Directions

The paper opens several avenues for future research:

Exploration of Fine-Tuning Techniques: Further experimentation with different fine-tuning approaches, including reinforcement learning from human feedback (RLHF) and adversarial training, could yield improvements in model truthfulness.
Expansion of Question Domains: Increasing the diversity and depth of question categories within TruthfulQA could provide more comprehensive assessments and uncover additional patterns of imitative falsehoods.
Increased Interpretability: Developing techniques to better understand why models generate false statements and how these can be systematically mitigated will be crucial for building more reliable AI systems.

In summary, the TruthfulQA benchmark represents a significant step towards understanding and mitigating the inherent falsehoods in current LLMs. The paper sheds light on the need for more sophisticated training and evaluation methods to enhance the reliability and utility of these models in real-world applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/anmorgan2414/status/1897693812878684401

YouTube

Show All Videos