Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

On Early Detection of Hallucinations in Factual Question Answering (2312.14183v3)

Published 19 Dec 2023 in cs.CL and cs.AI

Abstract: While LLMs have taken great strides towards helping humans with a plethora of tasks, hallucinations remain a major impediment towards gaining user trust. The fluency and coherence of model generations even when hallucinating makes detection a difficult task. In this work, we explore if the artifacts associated with the model generations can provide hints that the generation will contain hallucinations. Specifically, we probe LLMs at 1) the inputs via Integrated Gradients based token attribution, 2) the outputs via the Softmax probabilities, and 3) the internal state via self-attention and fully-connected layer activations for signs of hallucinations on open-ended question answering tasks. Our results show that the distributions of these artifacts tend to differ between hallucinated and non-hallucinated generations. Building on this insight, we train binary classifiers that use these artifacts as input features to classify model generations into hallucinations and non-hallucinations. These hallucination classifiers achieve up to $0.80$ AUROC. We also show that tokens preceding a hallucination can already predict the subsequent hallucination even before it occurs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. “Evaluating correctness and faithfulness of instruction-following models for question answering” In arXiv preprint arXiv:2307.16877, 2023
  2. “An important next step on our AI journey” Accessed: 2023-10-11, https://blog.google/technology/ai/bard-google-ai-search-updates/, 2023
  3. “The internal state of an llm knows when its lying” In arXiv preprint arXiv:2304.13734, 2023
  4. Jon Christian “Why Is Google Translate Spitting Out Sinister Religious Prophecies?” Accessed: 2023-08-16, https://www.vice.com/en/article/j5npeg/why-is-google-translate-spitting-out-sinister-religious-prophecies, 2018
  5. “Confirmed: the new Bing runs on OpenAI’s GPT-4” Accessed: 2023-10-11, https://blogs.bing.com/search/march_2023/Confirmed-the-new-Bing-runs-on-OpenAI%E2%80%99s-GPT-4, 2023
  6. “Detecting and Mitigating Hallucinations in Machine Translation: Model Internal Workings Alone Do Well, Sentence Similarity Even Better” In arXiv preprint arXiv:2212.08597, 2022
  7. “Towards a rigorous science of interpretable machine learning” In arXiv preprint arXiv:1702.08608, 2017
  8. “T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) Miyazaki, Japan: European Language Resources Association (ELRA), 2018 URL: https://aclanthology.org/L18-1544
  9. “Towards Opening the Black Box of Neural Machine Translation: Source and Target Interpretations of the Transformer” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 8756–8769 DOI: 10.18653/v1/2022.emnlp-main.599
  10. “Unsupervised quality estimation for neural machine translation” In Transactions of the Association for Computational Linguistics 8 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2020, pp. 539–555
  11. “Explaining explanations: An overview of interpretability of machine learning” In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), 2018, pp. 80–89 IEEE
  12. Nuno M. Guerreiro, Elena Voita and André Martins “Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation” In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics Dubrovnik, Croatia: Association for Computational Linguistics, 2023, pp. 1059–1075 URL: https://aclanthology.org/2023.eacl-main.75
  13. “A survey of methods for explaining black box models” In ACM computing surveys (CSUR) 51.5 ACM New York, NY, USA, 2018, pp. 1–42
  14. “Survey of hallucination in natural language generation” In ACM Computing Surveys 55.12 ACM New York, NY, 2023, pp. 1–38
  15. “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension” In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Vancouver, Canada: Association for Computational Linguistics, 2017, pp. 1601–1611 DOI: 10.18653/v1/P17-1147
  16. “Language models (mostly) know what they know” In arXiv preprint arXiv:2207.05221, 2022
  17. “Captum: A unified and generic model interpretability library for PyTorch”, 2020 arXiv:2009.07896 [cs.LG]
  18. “Retrieval-augmented generation for knowledge-intensive nlp tasks” In Advances in Neural Information Processing Systems 33, 2020, pp. 9459–9474
  19. “Holistic evaluation of language models” In arXiv preprint arXiv:2211.09110, 2022
  20. Stephanie Lin, Jacob Hilton and Owain Evans “Truthfulqa: Measuring how models mimic human falsehoods” In arXiv preprint arXiv:2109.07958, 2021
  21. Zachary C Lipton “The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.” In Queue 16.3 ACM New York, NY, USA, 2018, pp. 31–57
  22. Scott M Lundberg and Su-In Lee “A unified approach to interpreting model predictions” In Advances in neural information processing systems 30, 2017
  23. “Locating and editing factual associations in GPT” In Advances in Neural Information Processing Systems 35, 2022, pp. 17359–17372
  24. “Layer-wise relevance propagation: an overview” In Explainable AI: interpreting, explaining and visualizing deep learning Springer, 2019, pp. 193–209
  25. “Need a Last Minute Mother’s Day Gift? AI Is Here to Help.” Accessed: 2023-10-11, https://about.you.com/need-a-last-minute-mothers-day-gift-ai-is-here-to-help-d363b17e76b4/, 2023
  26. “Training language models to follow instructions with human feedback” In Advances in Neural Information Processing Systems 35, 2022, pp. 27730–27744
  27. Artidoro Pagnoni, Vidhisha Balachandran and Yulia Tsvetkov “Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics” In arXiv preprint arXiv:2104.13346, 2021
  28. “Language Models as Knowledge Bases?” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Hong Kong, China: Association for Computational Linguistics, 2019, pp. 2463–2473 DOI: 10.18653/v1/D19-1250
  29. “Amazon SageMaker Debugger: A system for real-time insights into machine learning model training” In MLSys 2021, 2021 URL: https://www.amazon.science/publications/amazon-sagemaker-debugger-a-system-for-real-time-insights-into-machine-learning-model-training
  30. “Open-domain conversational agents: Current progress, open problems, and future directions” In arXiv preprint arXiv:2006.12442, 2020
  31. Mukund Sundararajan, Ankur Taly and Qiqi Yan “Axiomatic attribution for deep networks” In International conference on machine learning, 2017, pp. 3319–3328 PMLR
  32. “Alpaca: A strong, replicable instruction-following model” In Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html 3.6, 2023, pp. 7
  33. “Welcome to the era of chatgpt et al. the prospects of large language models” In Business & Information Systems Engineering 65.2 Springer, 2023, pp. 95–101
  34. Elena Voita, Rico Sennrich and Ivan Titov “Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) Online: Association for Computational Linguistics, 2021, pp. 1126–1140 DOI: 10.18653/v1/2021.acl-long.91
  35. “On the lack of robust interpretability of neural text classifiers” In arXiv preprint arXiv:2106.04631, 2021
  36. “Ist-unbabel 2021 submission for the quality estimation shared task” In Proceedings of the Sixth Conference on Machine Translation, 2021, pp. 961–972
  37. “How language model hallucinations can snowball” In arXiv preprint arXiv:2305.13534, 2023
Citations (13)

Summary

  • The paper introduces binary classifiers using softmax, integrated gradients, self-attention, and fully-connected activations to detect hallucinations in LLM outputs with up to 0.82 AUROC.
  • It leverages internal model artifacts to differentiate between accurate and hallucinatory responses by analyzing confidence, token attribution, and activation clustering.
  • Experimental results across datasets like TriviaQA reveal that self-attention and fully-connected activations are most effective for early detection of hallucinated answers.

On Early Detection of Hallucinations in Factual Question Answering

Introduction

The paper "On Early Detection of Hallucinations in Factual Question Answering" (2312.14183) explores the challenge of detecting hallucinations in responses generated by LLMs. The fluency of these models can produce coherent text even when they provide incorrect factual information, termed as hallucinations. The work investigates the feasibility of early detection of hallucinations by analyzing artifacts derived from various stages of the LLM's question-answering pipeline.

Artifacts for Hallucination Detection

The paper systematically analyzes several types of artifacts associated with LLM generations: Softmax probabilities, Integrated Gradients (IG) feature attributions, self-attention scores, and fully-connected layer activations. These artifacts explore different aspects of the model's internal state and output behavior:

Softmax Probabilities

The Softmax probabilities at the output level are a direct measure of model confidence. The paper posits that hallucinations tend to correlate with higher entropy values in the Softmax probability distributions, suggesting a lower certainty in the model's predictions. Figure 1

Figure 1

Figure 1: Softmax probability distributions for hallucinated versus non-hallucinated responses, highlighting differences in confidence.

Integrated Gradients Attributions

IG attributions provide insights into which input tokens are deemed important for the generation of specific output tokens. The paper hypothesizes that hallucinated outputs exhibit dispersed attribution scores over the input tokens rather than focusing on key tokens, indicating potential uncertainty or irrelevance. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: IG Attributions showing differences in token importance between hallucinated and accurate generations.

Self-Attention Scores and Fully-Connected Activations

Self-attention scores and fully-connected activations offer a window into the model's internal processing states. Variations in these activations between hallucinated and non-hallucinated outputs can be detected, especially at the deeper layers of the Transformer model. The paper demonstrates that these activations vary significantly, which can be leveraged to identify hallucinations. Figure 3

Figure 3: TSNE clustering of self-attention scores differentiating hallucinated from non-hallucinated outputs.

Methodology

The authors develop binary classifiers trained on the aforementioned artifacts to categorize model outputs into hallucinations and non-hallucinations. These classifiers, built using different combinations of the artifacts, achieve AUROC scores as high as 0.82 for certain datasets and models, demonstrating significant potential for early hallucination detection. The classifiers highlight the effectiveness of self-attention and fully-connected activation scores in discerning hallucinations across various dataset types.

Experimental Setup

The evaluation encompasses a variety of datasets, including T-REx with subject-specific questions, and the broader TriviaQA dataset. The models tested include variants of OpenLLaMA, OPT, and Falcon, catering to different parameter sizes and architectures. Each experiment rigorously assesses both the base model tasks and the hallucination detection capabilities of the artifacts.

Results

Qualitative findings emphasize the discernible differences in entropy and clustering between artifacts generated from hallucinated versus correct responses. Quantitatively, self-attention scores and fully-connected activations consistently outperform IG attributions and Softmax probabilities in hallucination detection, achieving over 0.70 AUROC across most conditions. Figure 4

Figure 4: AUROC scores for different hallucinatory detectors using self-attention and fully-connected activations.

Conclusion

The demonstrated efficacy of early hallucination detection through model artifacts paves the way for enhanced reliability in LLM applications. Real-world implementations could leverage these classifiers to flag potential inaccuracies in model outputs before reaching the user, contributing to increased trust and applicability in critical domains such as web search and information retrieval. Future work may explore combining artifact-based detection with advanced retrieval mechanisms and fine-tuning strategies for broader applicability and robustness.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 1 like.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube