RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models (2401.00396v2)
Abstract: Retrieval-augmented generation (RAG) has become a main technique for alleviating hallucinations in LLMs. Despite the integration of RAG, LLMs may still present unsupported or contradictory claims to the retrieved contents. In order to develop effective hallucination prevention strategies under RAG, it is important to create benchmark datasets that can measure the extent of hallucination. This paper presents RAGTruth, a corpus tailored for analyzing word-level hallucinations in various domains and tasks within the standard RAG frameworks for LLM applications. RAGTruth comprises nearly 18,000 naturally generated responses from diverse LLMs using RAG. These responses have undergone meticulous manual annotations at both the individual cases and word levels, incorporating evaluations of hallucination intensity. We not only benchmark hallucination frequencies across different LLMs, but also critically assess the effectiveness of several existing hallucination detection methodologies. Furthermore, we show that using a high-quality dataset such as RAGTruth, it is possible to finetune a relatively small LLM and achieve a competitive level of performance in hallucination detection when compared to the existing prompt-based approaches using state-of-the-art LLMs such as GPT-4.
- Do language models know when they’re hallucinating references?
- Amos Azaria and Tom Mitchell. 2023. The Internal State of an LLM Knows When its Lying.
- Adversarial nli for factual correctness in text summarisation models. arXiv preprint arXiv:2005.11739.
- Language models are few-shot learners.
- Autohall: Automated hallucination dataset generation for large language models. ArXiv, abs/2310.00259.
- Felm: Benchmarking factuality evaluation of large language models.
- Factool: Factuality detection in generative ai – a tool augmented framework for multi-task and multi-domain scenarios.
- Lm vs lm: Detecting factual errors via cross examination.
- Chain-of-verification reduces hallucination in large language models.
- Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics.
- RARR: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
- On calibration of modern neural networks.
- A survey on automated fact-checking.
- Lora: Low-rank adaptation of large language models.
- Bschecker for fine-grained hallucination detection.
- Mistral 7b.
- Challenges and applications of large language models.
- Wice: Real-world entailment for claims in wikipedia. In Conference on Empirical Methods in Natural Language Processing.
- Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520.
- Retrieval-augmented generation for knowledge-intensive nlp tasks.
- HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models.
- Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273.
- Truthfulqa: Measuring how models mimic human falsehoods.
- Alisa Liu and Jiacheng Liu. 2023. The memotrap dataset. https://github.com/liujch1998/memo-trap.
- Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Andrey Malinin and Mark Gales. 2021. Uncertainty estimation in autoregressive structured prediction.
- When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories.
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models.
- MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback.
- Med-halt: Medical domain hallucination test for large language models.
- A Survey of Hallucination in Large Foundation Models.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- " why is this misleading?": Detecting news headline hallucinations with explanations. arXiv preprint arXiv:2302.05852.
- Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
- Label Studio: Data labeling software. Open source software available from https://github.com/heartexlabs/label-studio.
- Llama 2: Open foundation and fine-tuned chat models.
- Freshllms: Refreshing large language models with search engine augmentation.
- Asking and answering questions to evaluate the factual consistency of summaries.
- Yijun Xiao and William Yang Wang. 2021. On hallucination and predictive uncertainty in conditional language generation.
- Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.
- Yelp. 2021. Yelp open dataset. https://www.yelp.com/dataset. Accessed: 2023-11-03.
- R-tuning: Teaching large language models to refuse unknown questions. arXiv preprint arXiv:2311.09677.
- Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.
- A survey of large language models.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Lima: Less is more for alignment.
- Yuanhao Wu (4 papers)
- Juno Zhu (3 papers)
- Siliang Xu (3 papers)
- Kashun Shum (7 papers)
- Cheng Niu (15 papers)
- Randy Zhong (3 papers)
- Juntong Song (5 papers)
- Tong Zhang (569 papers)