Why Does ChatGPT Fall Short in Providing Truthful Answers? (2304.10513v3)
Abstract: Recent advancements in LLMs, such as ChatGPT, have demonstrated significant potential to impact various aspects of human life. However, ChatGPT still faces challenges in providing reliable and accurate answers to user questions. To better understand the model's particular weaknesses in providing truthful answers, we embark an in-depth exploration of open-domain question answering. Specifically, we undertake a detailed examination of ChatGPT's failures, categorized into: comprehension, factuality, specificity, and inference. We further pinpoint factuality as the most contributing failure and identify two critical abilities associated with factuality: knowledge memorization and knowledge recall. Through experiments focusing on factuality, we propose several potential enhancement strategies. Our findings suggest that augmenting the model with granular external knowledge and cues for knowledge recall can enhance the model's factuality in answering questions.
- The process of question answering - a computer simulation of cognition. American Journal of Computational Linguistics, 6(3-4), 1980.
- Towards a human-like open-domain chatbot. ArXiv preprint, abs/2001.09977, 2020.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity, 2023.
- A. Borji. A categorical archive of chatgpt failures, 2023.
- V. Braun and V. Clarke. Thematic analysis. American Psychological Association, 2012.
- Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv preprint, abs/2303.12712, 2023.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300.
- Mathematical capabilities of chatgpt, 2023.
- How close is chatgpt to human experts? comparison corpus, evaluation, and detection, 2023.
- J. Huang and K. C.-C. Chang. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023.
- Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2038–2047, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
- Can language models be specific? how? In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023a.
- Raven: In-context learning with retrieval augmented encoder-decoder language models. arXiv preprint arXiv:2308.07922, 2023b.
- Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv, 2208, 2022.
- Is chatgpt a good translator? yes with gpt-4 as the engine, 2023.
- Chatgpt: Jack of all trades, master of none, 2023.
- Internet-augmented dialogue generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8460–8478, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.579.
- Internet-augmented language models through few-shot prompting for open-domain question answering, 2022.
- Factuality enhanced language models for open-ended text generation, 2023.
- A survey on multi-hop question answering and generation, 2022.
- M. McHugh. Interrater reliability: The kappa statistic. Biochemia medica : časopis Hrvatskoga društva medicinskih biokemičara / HDMB, 22:276–82, 2012. doi: 10.11613/BM.2012.031.
- OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1250.
- Is chatgpt a general-purpose natural language processing task solver?, 2023.
- P. P. Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
- Replug: Retrieval-augmented black-box language models. ArXiv preprint, abs/2301.12652, 2023.
- An analysis of the automatic bug fixing performance of chatgpt, 2023.
- Evaluation of chatgpt as a question answering system for answering complex questions, 2023.
- On the robustness of chatgpt: An adversarial and out-of-distribution perspective, 2023.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1259.
- Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert, 2023.
- Exploring ai ethics of chatgpt: A diagnostic analysis, 2023.
- Shen Zheng (18 papers)
- Jie Huang (156 papers)
- Kevin Chen-Chuan Chang (53 papers)