Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations (2403.18167v2)
Abstract: State-of-the-art LMs sometimes generate non-factual hallucinations that misalign with world knowledge. To explore the mechanistic causes of these hallucinations, we create diagnostic datasets with subject-relation queries and adapt interpretability methods to trace hallucinations through internal model representations. We discover two general and distinct mechanistic causes of hallucinations shared across LMs (Llama-2, Pythia, GPT-J): 1) knowledge enrichment hallucinations: insufficient subject attribute knowledge in lower layer MLPs, and 2) answer extraction hallucinations: failure to select the correct object attribute in upper layer attention heads. We also found these two internal mechanistic causes of hallucinations are reflected in external manifestations. Based on insights from our mechanistic analysis, we propose a novel hallucination mitigation method through targeted restoration of the LM's internal fact recall pipeline, demonstrating superior performance compared to baselines.
- Towards tracing knowledge in language models back to the training data. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2429–2446.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.
- Learning with rejection for abstractive text summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9768–9780, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Lm vs lm: Detecting factual errors via cross examination. arXiv preprint arXiv:2305.13281.
- Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997.
- Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502.
- Analyzing transformers in embedding space. In Annual Meeting of the Association for Computational Linguistics.
- Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506.
- A survey of natural language generation. ACM Computing Surveys, 55(8):1–38.
- Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764.
- Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
- A mathematical framework for transformer circuits. Transformer Circuits Thread. Https://transformer-circuits.pub/2021/framework/index.html.
- Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1828–1843.
- Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767.
- Lm-debugger: An interactive tool for inspection and intervention in transformer-based language models. In Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 12–21.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495.
- Patchscope: A unifying framework for inspecting hidden representations of language models. arXiv preprint arXiv:2401.06102.
- Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1382–1390, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Linearity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169.
- Backward lens: Projecting language model gradients into the vocabulary space. arXiv preprint arXiv:2402.12865.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
- Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
- Influence patterns for explaining information flow in bert. Advances in Neural Information Processing Systems, 34:4461–4474.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
- Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229.
- Fast model editing at scale. In International Conference on Learning Representations.
- Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
- Neel Nanda. 2023. Attribution patching: Activation patching at industrial scale. URL: https://www. neelnanda. io/mechanistic-interpretability/attribution-patching.
- Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations.
- Nostalgebraist. 2020. Interpreting gpt: the logit lens. LESSWRONG.
- Chris Olah. 2022. Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread. Https://transformer-circuits.pub/2022/mech-interp-essay/index.html.
- Judea Pearl. 2001. Direct and indirect effects. In Proc. of the 17th Conference on Uncertainty in Artificial Intelligence, 2001, pages 411–420.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7035–7052.
- Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR.
- Reducing hallucinations in neural machine translation with feature attribution. arXiv preprint arXiv:2211.09878.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388.
- A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
- Attention is all you need. Advances in neural information processing systems, 30.
- Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388–12401.
- BERTnesia: Investigating the capture and forgetting of knowledge in BERT. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 174–183, Online. Association for Computational Linguistics.
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In The Eleventh International Conference on Learning Representations.
- Yijun Xiao and William Yang Wang. 2021. On hallucination and predictive uncertainty in conditional language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2734–2744, Online. Association for Computational Linguistics.
- Characterizing mechanisms for factual recall in language models. arXiv preprint arXiv:2310.15910.
- Attention satisfies: A constraint-satisfaction lens on factual errors of language models. arXiv preprint arXiv:2309.15098.
- How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Detecting hallucinated content in conditional neural sequence generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1393–1404.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.