One vs. Many: Comprehending Accurate Information from Multiple Erroneous and Inconsistent AI Generations (2405.05581v1)
Abstract: As LLMs are nondeterministic, the same input can generate different outputs, some of which may be incorrect or hallucinated. If run again, the LLM may correct itself and produce the correct answer. Unfortunately, most LLM-powered systems resort to single results which, correct or not, users accept. Having the LLM produce multiple outputs may help identify disagreements or alternatives. However, it is not obvious how the user will interpret conflicts or inconsistencies. To this end, we investigate how users perceive the AI model and comprehend the generated information when they receive multiple, potentially inconsistent, outputs. Through a preliminary study, we identified five types of output inconsistencies. Based on these categories, we conducted a study (N=252) in which participants were given one or more LLM-generated passages to an information-seeking question. We found that inconsistency within multiple LLM-generated outputs lowered the participants' perceived AI capacity, while also increasing their comprehension of the given information. Specifically, we observed that this positive effect of inconsistencies was most significant for participants who read two passages, compared to those who read three. Based on these findings, we present design implications that, instead of regarding LLM output inconsistencies as a drawback, we can reveal the potential inconsistencies to transparently indicate the limitations of these models and promote critical LLM usage.
- Khan Academy. 2023. Khanmigo: Khan Academy’s AI-powered teaching assistant. Retrieved January 21, 2024 from https://blog.khanacademy.org/teacher-khanmigo
- Concrete Problems in AI Safety. ArXiv abs/1606.06565 (2016). https://api.semanticscholar.org/CorpusID:10242377
- Evaluating the Effects of Displaying Uncertainty in Context-Aware Applications. In Ubiquitous Computing. https://api.semanticscholar.org/CorpusID:2342122
- Akari Asai and Hannaneh Hajishirzi. 2020. Logic-Guided Data Augmentation and Regularization for Consistent Question Answering. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:216035859
- Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff. In AAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:53997192
- Open LLM Leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. Psychology and the real world: Essays illustrating fundamental contributions to society 2, 59-68 (2011).
- Richard E. Boyatzis. 1998. Transforming Qualitative Information: Thematic Analysis and Code Development.
- To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making. Proc. ACM Hum.-Comput. Interact. 5, CSCW1, Article 188 (apr 2021), 21 pages. https://doi.org/10.1145/3449287
- Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:203905467
- RELIC: Investigating Large Language Model Responses using Self-Consistency. ArXiv abs/2311.16842 (2023). https://api.semanticscholar.org/CorpusID:265466244
- Explaining Decision-Making Algorithms through UI: Strategies to Help Non-Expert Stakeholders. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019). https://api.semanticscholar.org/CorpusID:140281803
- Avishek Choudhury and Hamid Shamszare. 2023. Investigating the Impact of User Trust on the Adoption and Use of ChatGPT: Survey Analysis. Journal of Medical Internet Research 25 (2023). https://api.semanticscholar.org/CorpusID:258922988
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. ArXiv abs/1803.05457 (2018). https://api.semanticscholar.org/CorpusID:3922816
- LM vs LM: Detecting Factual Errors via Cross Examination. In Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:258833288
- Lynne M. Connelly. 2013. Grounded theory. Medsurg nursing : official journal of the Academy of Medical-Surgical Nurses 22 2 (2013), 124, 127.
- An Exploration of Location Error Estimation. In Ubiquitous Computing. https://api.semanticscholar.org/CorpusID:9535527
- Pierre Dragicevic. 2016. Fair Statistical Communication in HCI. https://api.semanticscholar.org/CorpusID:64470036
- Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics 9 (2021), 1012–1031. https://api.semanticscholar.org/CorpusID:231740560
- QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization. In North American Chapter of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:245218667
- Who Goes First? Influences of Human-AI Workflow on Decision Making in Clinical Imaging. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (2022). https://api.semanticscholar.org/CorpusID:248887734
- Krzysztof Z. Gajos and Lena Mamykina. 2022. Do People Engage Cognitively with AI? Impact of AI Assistance on Incidental Learning. In 27th International Conference on Intelligent User Interfaces (IUI ’22). Association for Computing Machinery, New York, NY, USA, 794–806. https://doi.org/10.1145/3490099.3511138
- Arthur C. Graesser and Natalie K. Person. 1994. Question Asking During Tutoring. https://api.semanticscholar.org/CorpusID:15485207
- Measuring Massive Multitask Language Understanding. ArXiv abs/2009.03300 (2020). https://api.semanticscholar.org/CorpusID:221516475
- Unsolved Problems in ML Safety. ArXiv abs/2109.13916 (2021). https://api.semanticscholar.org/CorpusID:238198240
- A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ArXiv abs/2311.05232 (2023). https://api.semanticscholar.org/CorpusID:265067168
- Accurate, yet inconsistent? Consistency Analysis on Language Understanding Models. ArXiv abs/2108.06665 (2021). https://api.semanticscholar.org/CorpusID:237091303
- Survey of Hallucination in Natural Language Generation. Comput. Surveys 55 (2022), 1 – 38. https://api.semanticscholar.org/CorpusID:246652372
- TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ArXiv abs/1705.03551 (2017). https://api.semanticscholar.org/CorpusID:26501419
- Pamela Karr-Wisniewski and Ying Lu. 2010. When more is too much: Operationalizing technology overload and exploring its impact on knowledge worker productivity. Computers in Human Behavior 26, 5 (2010), 1061–1072.
- Conceptual Metaphors Impact Perceptions of Human-AI Collaboration. Proceedings of the ACM on Human-Computer Interaction 4 (2020), 1 – 26. https://api.semanticscholar.org/CorpusID:221005643
- Humans, AI, and Context: Understanding End-Users’ Trust in a Real-World Computer Vision Application. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’23). Association for Computing Machinery, New York, NY, USA, 77–88. https://doi.org/10.1145/3593013.3593978
- Challenges and Opportunities of Moderating Usage of Large Language Models in Education. ArXiv abs/2312.14969 (2023). https://api.semanticscholar.org/CorpusID:266550959
- Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466. https://api.semanticscholar.org/CorpusID:86611921
- SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization. Transactions of the Association for Computational Linguistics 10 (2021), 163–177. https://api.semanticscholar.org/CorpusID:244345901
- Towards a Science of Human-AI Decision Making: An Overview of Design Space in Empirical Human-Subject Studies. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (2023). https://api.semanticscholar.org/CorpusID:259139714
- Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection. Proceedings of the Conference on Fairness, Accountability, and Transparency (2019). https://api.semanticscholar.org/CorpusID:53774958
- Interpretable Decision Sets: A Joint Framework for Description and Prediction. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016). https://api.semanticscholar.org/CorpusID:12533380
- A taxonomy for software voting algorithms used in safety-critical systems. IEEE Transactions on Reliability 53, 3 (2004), 319–328. https://doi.org/10.1109/TR.2004.832819
- DAPIE: Interactive Step-by-Step Explanatory Dialogues to Answer Children’s Why and How Questions. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusID:258216670
- From ChatGPT to FactGPT: A Participatory Design Study to Mitigate the Effects of Large Language Model Hallucinations on Users. In Proceedings of Mensch Und Computer 2023 (MuC ’23). Association for Computing Machinery, New York, NY, USA, 81–90. https://doi.org/10.1145/3603555.3603565
- The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models. https://api.semanticscholar.org/CorpusID:266844012
- Why and why not explanations improve the intelligibility of context-aware intelligent systems. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2009). https://api.semanticscholar.org/CorpusID:4507550
- TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:237532606
- Teaching Models to Express Their Uncertainty in Words. Trans. Mach. Learn. Res. 2022 (2022). https://api.semanticscholar.org/CorpusID:249191391
- Diane M Mackie. 1987. Systematic and nonsystematic processing of majority and minority persuasive communications. Journal of Personality and Social Psychology 53, 1 (1987), 41.
- Erwin Marsi and Ferdi Van Rooden. 2007. Expressing uncertainty with a talking head in a multimodal question-answering system. https://api.semanticscholar.org/CorpusID:2482651
- Reducing Conversational Agents’ Overconfidence Through Linguistic Calibration. Transactions of the Association for Computational Linguistics 10 (2022), 857–872. https://doi.org/10.1162/tacl_a_00494
- Tim Miller. 2023. Explainable AI is Dead, Long Live Explainable AI! Hypothesis-driven Decision Support using Evaluative AI. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’23). Association for Computing Machinery, New York, NY, USA, 333–342. https://doi.org/10.1145/3593013.3594001
- FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. ArXiv abs/2305.14251 (2023). https://api.semanticscholar.org/CorpusID:258841470
- Explaining Explanations in AI. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 279–288. https://doi.org/10.1145/3287560.3287574
- The Effects of Meaningful and Meaningless Explanations on Trust and Perceived System Accuracy in Intelligent Systems. In AAAI Conference on Human Computation & Crowdsourcing. https://api.semanticscholar.org/CorpusID:201639081
- OpenAI. 2022. Introducing ChatGPT. Retrieved January 17, 2024 from https://openai.com/blog/chatgpt
- Understanding the Impact of Explanations on Advice-Taking: A User Study for AI-Based Clinical Decision Support Systems. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 568, 9 pages. https://doi.org/10.1145/3491102.3502104
- How Accurate Does It Feel? – Human Perception of Different Types of Classification Mistakes. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 180, 13 pages. https://doi.org/10.1145/3491102.3501915
- AI Deception: A Survey of Examples, Risks, and Potential Solutions. arXiv:cs.CY/2308.14752
- KILT: a Benchmark for Knowledge Intensive Language Tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, Online, 2523–2544. https://doi.org/10.18653/v1/2021.naacl-main.200
- “I Think You Might Like This”: Exploring Effects of Confidence Signal Patterns on Trust in and Reliance on Conversational Recommender Systems. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (2023). https://api.semanticscholar.org/CorpusID:259139753
- On the Systematicity of Probing Contextualized Word Representations: The Case of Hypernymy in BERT. In STARSEM. https://api.semanticscholar.org/CorpusID:227230677
- Are Red Roses Red? Evaluating Consistency of Question-Answering Models. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:196182403
- Anchors: High-Precision Model-Agnostic Explanations. In AAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:3366554
- The impact of artificial intelligence on learner–instructor interaction in online learning. International journal of educational technology in higher education 18, 1 (2021), 1–23.
- Comparing Traditional and LLM-based Search for Consumer Choice: A Randomized Experiment. ArXiv abs/2307.03744 (2023). https://api.semanticscholar.org/CorpusID:259375527
- texSketch: Active Diagramming through Pen-and-Ink Annotations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376155
- Sensecape: Enabling Multilevel Exploration and Sensemaking with Large Language Models. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (2023). https://api.semanticscholar.org/CorpusID:258822925
- Quantifying Uncertainty in Foundation Models via Ensembles. https://api.semanticscholar.org/CorpusID:254183614
- Med-HALT: Medical Domain Hallucination Test for Large Language Models. In Conference on Computational Natural Language Learning. https://api.semanticscholar.org/CorpusID:260316324
- Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. ArXiv abs/2004.04228 (2020). https://api.semanticscholar.org/CorpusID:215548661
- People’s Perceptions Toward Bias and Related Concepts in Large Language Models: A Systematic Review. ArXiv abs/2309.14504 (2023). https://api.semanticscholar.org/CorpusID:262825989
- Self-Consistency Improves Chain of Thought Reasoning in Language Models. ArXiv abs/2203.11171 (2022). https://api.semanticscholar.org/CorpusID:247595263
- Toward General Design Principles for Generative AI Applications 130-144. In IUI Workshops. https://api.semanticscholar.org/CorpusID:255825625
- Artificial intelligence, artificial teachers and the fate of learners in the 21st century education sector: Implications for theory and practice. International Journal of Pure and Applied Mathematics 119, 16 (2018), 2245–2259.
- Understanding the Effect of Accuracy on Trust in Machine Learning Models. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019). https://api.semanticscholar.org/CorpusID:109927933
- You Complete Me: Human-AI Teams and Complementary Expertise. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 114, 28 pages. https://doi.org/10.1145/3491102.3517791
- Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (2020). https://api.semanticscholar.org/CorpusID:210023849
- A Survey of Large Language Models. arXiv:cs.CL/2303.18223
- Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models. ArXiv abs/2302.13439 (2023). https://api.semanticscholar.org/CorpusID:257220189