The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models (2401.03205v1)
Abstract: In the era of LLMs, hallucination (i.e., the tendency to generate factually incorrect content) poses great challenge to trustworthy and reliable deployment of LLMs in real-world applications. To tackle the LLM hallucination, three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them (mitigation). To address these challenges, this work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation. Specially, we construct a new hallucination benchmark HaluEval 2.0, and designs a simple yet effective detection method for LLM hallucination. Furthermore, we zoom into the different training or utilization stages of LLMs and extensively analyze the potential factors that lead to the LLM hallucination. Finally, we implement and examine a series of widely used techniques to mitigate the hallucinations in LLMs. Our work has led to several important findings to understand the hallucination origin and mitigate the hallucinations in LLMs. Our code and data can be accessed at https://github.com/RUCAIBox/HaluEval-2.0.
- Do language models know when they’re hallucinating references? arXiv preprint arXiv:2305.18248.
- Falcon-40B: an open large language model with state-of-the-art performance.
- Amos Azaria and Tom Mitchell. 2023a. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
- Amos Azaria and Tom M. Mitchell. 2023b. The internal state of an LLM knows when its lying. CoRR, abs/2304.13734.
- Gpt-neox-20b: An open-source autoregressive language model. CoRR, abs/2204.06745.
- A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings, volume 9626 of Lecture Notes in Computer Science, pages 716–722. Springer.
- Quality-diversity through AI feedback. CoRR, abs/2310.13032.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Purr: Efficiently editing language model hallucinations by denoising language model corruptions. arXiv preprint arXiv:2305.14908.
- Learningq: A large-scale dataset for educational question generation. In Proceedings of the Twelfth International Conference on Web and Social Media, ICWSM 2018, Stanford, California, USA, June 25-28, 2018, pages 481–490. AAAI Press.
- Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
- Factool: Factuality detection in generative AI - A tool augmented framework for multi-task and multi-domain scenarios. CoRR, abs/2307.13528.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
- Diving deep into modes of fact hallucinations in dialogue systems. arXiv preprint arXiv:2301.04449.
- Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
- Chain-of-verification reduces hallucination in large language models. CoRR, abs/2309.11495.
- Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 2197–2214. Association for Computational Linguistics.
- Dom Eccleston. 2023. Sharegpt. https://sharegpt.com/.
- Bridging the gap: A survey on integrating (human) feedback for natural language generation. arXiv preprint arXiv:2305.00955.
- A survey of quantization methods for efficient neural network inference. CoRR, abs/2103.13630.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. CoRR, abs/2311.05232.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Towards mitigating hallucination in large language models via self-reflection. CoRR, abs/2310.06271.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
- Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution. arXiv preprint arXiv:2307.16883.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781.
- Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10:170.
- Factuality enhanced language models for open-ended text generation. In NeurIPS.
- Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 35:34586–34599.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. CoRR, abs/2305.11747.
- The web can be your oyster for improving language models. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 728–746. Association for Computational Linguistics.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
- How pre-trained language models capture factual knowledge? A causal-inspired analysis. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1720–1732. Association for Computational Linguistics.
- Lost in the middle: How language models use long contexts. CoRR, abs/2307.03172.
- Fiona Macpherson and Dimitris Platchias. 2013. Hallucination: Philosophy and psychology. MIT Press.
- Www’18 open challenge: financial opinion mining and question answering. In Companion Proceedings of the The Web Conference 2018, pages 1941–1942.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
- Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
- OpenAI. 2023. Gpt-4 technical report. OpenAI.
- Training language models to follow instructions with human feedback. CoRR, abs/2203.02155.
- Exploring the relationship between LLM hallucinations and prompt linguistic nuances: Readability, formality, and concreteness. CoRR, abs/2309.11064.
- Exploring the relationship between llm hallucinations and prompt linguistic nuances: Readability, formality, and concreteness. arXiv preprint arXiv:2309.11064.
- Investigating the factual knowledge boundary of large language models with retrieval augmentation. arXiv preprint arXiv:2307.11019.
- Shubhra Kanti Karmaker Santu and Dongji Feng. 2023. Teler: A general taxonomy of LLM prompts for benchmarking complex tasks. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 14197–14203. Association for Computational Linguistics.
- The intended uses of automated fact-checking artefacts: Why, how and who. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 8618–8642. Association for Computational Linguistics.
- John Schulman. 2023. Reinforcement learning from human feedback: Progress and challenges.
- Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739.
- Trusting your evidence: Hallucinate less with context-aware decoding. CoRR, abs/2305.14739.
- Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Supervised open information extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 885–895, New Orleans, Louisiana. Association for Computational Linguistics.
- Moss: Training conversational language models from synthetic data.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Galactica: A large language model for science. CoRR, abs/2211.09085.
- The fact extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 1–9, Brussels, Belgium. Association for Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Med-halt: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343.
- Mutual information alleviates hallucinations in abstractive summarization. arXiv preprint arXiv:2210.13210.
- A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
- Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 7534–7550. Association for Computational Linguistics.
- Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering. CoRR, abs/2308.13259.
- Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Wizardlm: Empowering large language models to follow complex instructions. CoRR, abs/2304.12244.
- Baichuan 2: Open large-scale language models. CoRR, abs/2309.10305.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
- Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469.
- Cognitive mirage: A review of hallucinations in large language models. CoRR, abs/2309.06794.
- Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063.
- Attention satisfies: A constraint-satisfaction lens on factual errors of language models. arXiv preprint arXiv:2309.15098.
- YuLan-Team. 2023. Yulan-chat: An open-source bilingual chatbot. https://github.com/RUC-GSAI/YuLan-Chat.
- How language model hallucinations can snowball. CoRR, abs/2305.13534.
- Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Junyi Li (92 papers)
- Jie Chen (602 papers)
- Ruiyang Ren (18 papers)
- Xiaoxue Cheng (12 papers)
- Wayne Xin Zhao (196 papers)
- Jian-Yun Nie (70 papers)
- Ji-Rong Wen (299 papers)