TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space (2402.17811v2)
Abstract: LLMs sometimes suffer from producing hallucinations, especially LLMs may generate untruthful responses despite knowing the correct knowledge. Activating the truthfulness within LLM is the key to fully unlocking LLM's knowledge potential. In this paper, we propose TruthX, an inference-time intervention method to activate the truthfulness of LLM by identifying and editing the features within LLM's internal representations that govern the truthfulness. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLM. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that TruthX can control LLM to produce truthful or hallucinatory responses via editing only one vector in LLM's internal representations.
- Guillaume Alain and Yoshua Bengio. 2017. Understanding intermediate layers using linear classifier probes.
- Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics.
- Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
- Robustness of edited neural networks. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
- Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations.
- Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Dola: Decoding by contrasting layers improves factuality in large language models.
- Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations.
- Chain-of-verification reduces hallucination in large language models.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Inspecting and editing knowledge representations in language models.
- Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
- Mistral 7b.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
- Language models (mostly) know what they know.
- Sh2: Self-highlighted hesitation helps you decode more truthfully.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations.
- Inference-time intervention: Eliciting truthful answers from a language model.
- Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312, Toronto, Canada. Association for Computational Linguistics.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- Second thoughts are best: Learning to re-align with human values from text edits. In Advances in Neural Information Processing Systems, volume 35, pages 181–196. Curran Associates, Inc.
- Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.
- Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems.
- Generating benchmarks for factuality evaluation of language models. arXiv preprint arXiv:2307.06908.
- OpenAI. 2022. Introducing chatgpt.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- Self-critiquing models for assisting human evaluators.
- Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Extracting latent steering vectors from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 566–581, Dublin, Ireland. Association for Computational Linguistics.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- What makes for good views for contrastive learning? In Advances in Neural Information Processing Systems, volume 33, pages 6827–6839. Curran Associates, Inc.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
- Auto-encoder based dimensionality reduction. Neurocomputing, 184:232–242. RoLoD: Robust Local Descriptors for Computer Vision 2014.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
- Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models. arXiv preprint arXiv:2306.10968.
- Alleviating hallucinations of large language models through induced hallucinations.
- Siren’s song in the ai ocean: A survey on hallucination in large language models.
- Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. Just Accepted.
- Representation engineering: A top-down approach to ai transparency.