Designing for Appropriate Reliance: The Roles of AI Uncertainty Presentation, Initial User Decision, and User Demographics in AI-Assisted Decision-Making (2401.05612v2)
Abstract: Appropriate reliance is critical to achieving synergistic human-AI collaboration. For instance, when users over-rely on AI assistance, their human-AI team performance is bounded by the model's capability. This work studies how the presentation of model uncertainty may steer users' decision-making toward fostering appropriate reliance. Our results demonstrate that showing the calibrated model uncertainty alone is inadequate. Rather, calibrating model uncertainty and presenting it in a frequency format allow users to adjust their reliance accordingly and help reduce the effect of confirmation bias on their decisions. Furthermore, the critical nature of our skin cancer screening task skews participants' judgment, causing their reliance to vary depending on their initial decision. Additionally, step-wise multiple regression analyses revealed how user demographics such as age and familiarity with probability and statistics influence human-AI collaborative decision-making. We discuss the potential for model uncertainty presentation, initial user decision, and user demographics to be incorporated in designing personalized AI aids for appropriate reliance.
- Harry Bakwin. 1945. Pseudodoxia pediatrica. New England journal of medicine 232, 24 (1945), 691–697.
- Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.
- Maya Bar-Hillel. 1980. The base-rate fallacy in probability judgments. Acta Psychologica 44, 3 (1980), 211–233.
- Review of best practice recommendations for ensuring high quality data with amazon’s mechanical turk. (2020).
- Effects of friends’ characteristics on children’s social cognitions. Social Development 8, 1 (1999), 41–51.
- Mobile applications in dermatology. JAMA dermatology 149, 11 (2013), 1300–1304.
- Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces. 454–464.
- To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (2021), 1–21.
- Peggy Bui and Yuan Liu. 2021. Using AI to help find answers to common skin conditions. https://blog.google/technology/health/ai-dermatology-preview-io-2021/
- The role of explanations on trust and reliance in clinical decision support systems. In 2015 international conference on healthcare informatics. IEEE, 160–169.
- Andrius Buteikis. 2020. Practical econometrics and data science. Vilnius University: Vilnius, Lithuania (2020).
- ” Hello AI”: uncovering the onboarding needs of medical practitioners for human-AI collaborative decision-making. Proceedings of the ACM on Human-computer Interaction 3, CSCW (2019), 1–24.
- How Time Pressure in Different Phases of Decision-Making Influences Human-AI Collaboration. Proceedings of the ACM on Human-Computer Interaction CSCW2 (2023), 1–25.
- Shiye Cao and Chien-Ming Huang. 2022. Understanding User Reliance on AI in Assisted Decision-Making. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–23.
- Do explanations make VQA models more predictable to a human? arXiv preprint arXiv:1810.12366 (2018).
- Human confidence in artificial intelligence and in themselves: The evolution and impact of confidence on adoption of AI advice. Computers in Human Behavior 127 (2022), 107018.
- Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). arXiv preprint arXiv:1902.03368 (2019).
- Leda Cosmides and John Tooby. 1996. Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty. cognition 58, 1 (1996), 1–73.
- Pierre Cossette. 2014. Heuristics and cognitive biases in entrepreneurs: a review of the research. Journal of Small Business & Entrepreneurship 27, 5 (2014), 471–496.
- Morris H DeGroot and Stephen E Fienberg. 1983. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician) 32, 1-2 (1983), 12–22.
- Thomas G Dietterich. 2000. Ensemble methods in machine learning. In International workshop on multiple classifier systems. Springer, 1–15.
- Shi Feng and Jordan Boyd-Graber. 2019. What can ai do for me? evaluating machine learning interpretations in cooperative play. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 229–239.
- John Fox and Georges Monette. 1992. Generalized collinearity diagnostics. J. Amer. Statist. Assoc. 87, 417 (1992), 178–183.
- Celia Gaertig and Joseph P Simmons. 2018. Do people inherently dislike uncertain advice? Psychological Science 29, 4 (2018), 504–520.
- Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning. PMLR, 1050–1059.
- Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ digital medicine 4, 1 (2021), 1–8.
- Gerd Gigerenzer. 1991. How to make cognitive illusions disappear: Beyond “heuristics and biases”. European review of social psychology 2, 1 (1991), 83–115.
- Gerd Gigerenzer. 1996. The psychology of good judgment: frequency formats and simple algorithms. Medical decision making 16, 3 (1996), 273–280.
- Helping doctors and patients make sense of health statistics. Psychological science in the public interest 8, 2 (2007), 53–96.
- Gerd Gigerenzer and Ulrich Hoffrage. 1995. How to improve Bayesian reasoning without instruction: frequency formats. Psychological review 102, 4 (1995), 684.
- Attachment and trust in artificial intelligence. Computers in Human Behavior 115 (2021), 106607.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
- On calibration of modern neural networks. In International conference on machine learning. PMLR, 1321–1330.
- Prevalence and costs of skin cancer treatment in the US, 2002- 2006 and 2007- 2011. American journal of preventive medicine 48, 2 (2015), 183–187.
- Robert M Hamm. 1994. Underweighting of base-rate information reflects important difficulties people have with probabilistic inference. Psycoloquy 5, 3 (1994).
- Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016).
- Chien-Ming Huang and Bilge Mutlu. 2014. Multivariate evaluation of interactive robot systems. Autonomous Robots 37, 4 (2014), 335–349.
- How machine-learning recommendations influence clinician treatment selections: the example of antidepressant selection. Translational psychiatry 11, 1 (2021), 1–9.
- Development and assessment of an artificial intelligence–based tool for skin condition diagnosis by primary care physicians and nurse practitioners in teledermatology practices. JAMA network open 4, 4 (2021), e217249–e217249.
- Daniel Kahneman. 2011. Thinking, fast and slow. Macmillan.
- ” What If It Is Wrong”: Effects of Power Dynamics and Trust Repair Strategy on Trust and Compliance in HRI. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. 271–280.
- Garry Kasparov. 2010. The chess master and the computer. The New York Review of Books 57, 2 (2010), 16–19.
- Human decisions and machine predictions. The quarterly journal of economics 133, 1 (2018), 237–293.
- Reasons for confidence. Journal of Experimental Psychology: Human learning and memory 6, 2 (1980), 107.
- Cultural influences on entrepreneurial orientation: The impact of national culture on risk taking and proactiveness in SMEs. Entrepreneurship theory and practice 34, 5 (2010), 959–984.
- Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in neural information processing systems 32 (2019).
- Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Artificial Intelligence and Statistics. PMLR, 623–631.
- ” Why is’ Chicago’deceptive?” Towards Building Model-Driven Tutorials for Humans. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.
- Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the conference on fairness, accountability, and transparency. 29–38.
- Building machines that learn and think like people. Behavioral and brain sciences 40 (2017).
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30 (2017).
- Brian Y Lim and Anind K Dey. 2011. Investigating intelligibility for uncertain context-aware applications. In Proceedings of the 13th international conference on Ubiquitous computing. 415–424.
- Understanding the Effect of Out-of-distribution Examples and Interactive Explanations on Human-AI Decision Making. arXiv preprint arXiv:2101.05303 (2021).
- Understanding the effect of out-of-distribution examples and interactive explanations on human-ai decision making. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–45.
- A deep learning system for differential diagnosis of skin diseases. Nature medicine 26, 6 (2020), 900–908.
- Algorithm appreciation: People prefer algorithmic to human judgment. Organizational Behavior and Human Decision Processes 151 (2019), 90–103.
- Zhuoran Lu and Ming Yin. 2021. Human Reliance on Machine Learning Models When Performance Feedback is Limited: Heuristics and Risks. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.
- Amama Mahmood and Chien-Ming Huang. 2022. Effects of rhetorical strategies and skin tones on agent persuasiveness in assisted decision-making. In Proceedings of the 22nd ACM International Conference on Intelligent Virtual Agents. 1–8.
- Daniel McFadden. 2021. Quantitative methods for analysing travel behaviour of individuals: some recent developments. In Behavioural travel modelling. Routledge, 279–318.
- Stephanie M Merritt. 2011. Affective processes in human–automation interactions. Human Factors 53, 4 (2011), 356–370.
- Machine Learning Explanations to Prevent Overtrust in Fake News Detection. arXiv preprint arXiv:2007.12358 (2020).
- Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
- Tess Neal and Thomas Grisso. 2014. The cognitive underpinnings of bias in forensic mental health evaluations. Psychology, Public Policy, and Law 20, 2 (2014), 200.
- Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. 625–632.
- Kazuo Okamura and Seiji Yamada. 2020. Adaptive trust calibration for human-AI collaboration. Plos one 15, 2 (2020), e0229132.
- Samir Passi and Mihaela Vorvoreanu. 2022. Overreliance on AI Literature Review. Microsoft Research (2022).
- John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10, 3 (1999), 61–74.
- Can we trust deep learning models diagnosis? The impact of domain shift in chest radiograph classification. arXiv preprint arXiv:1909.01940 (2019).
- Individual differences in the calibration of trust in automation. Human factors 57, 4 (2015), 545–556.
- Manipulating and measuring model interpretability. In Proceedings of the 2021 CHI conference on human factors in computing systems. 1–52.
- Paul C Price and Eric R Stone. 2004. Intuitive evaluation of likelihood judgment producers: Evidence for a confidence heuristic. Journal of Behavioral Decision Making 17, 1 (2004), 39–57.
- Dataset shift in machine learning. Mit Press.
- A Unifying Framework for Combining Complementary Strengths of Humans and ML toward Better Predictive Decision-Making. arXiv preprint arXiv:2204.10806 (2022).
- Deciding fast and slow: The role of cognitive biases in ai-assisted decision-making. arXiv preprint arXiv:2010.07938 (2020).
- Amy Rechkemmer and Ming Yin. 2022. When Confidence Meets Accuracy: Exploring the Effects of Multiple Performance Indicators on Trust in Machine Learning Models. In CHI Conference on Human Factors in Computing Systems. 1–14.
- Cognitive biases associated with medical decisions: a systematic review. BMC medical informatics and decision making 16, 1 (2016), 1–14.
- Allison Sauppé and Bilge Mutlu. 2014. How social cues shape task coordination and communication. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 97–108.
- I can do better than your AI: expertise and explanations. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 240–251.
- Thomas J Scheff. 1963. Decision rules, types of error, and their consequences in medical diagnosis. Behavioral Science 8, 2 (1963), 97–107.
- Best of both worlds: local and global explanations with human-understandable concepts. arXiv preprint arXiv:2106.08641 (2021).
- It’s a Disaster! Factors Affecting Trust Development and Repair Following Agent Task Failure. In Proceedings of the 2020 Australasian Conference on Robotics and Automation (ACRA 2020), 8-10 December 2020, Brisbane, Queensland.
- Stacy Simon. 2020. How to spot skin cancer. https://www.cancer.org/latest-news/how-to-spot-skin-cancer.html
- A hybrid customer prediction system based on multiple forward stepwise logistic regression mode. Intelligent Data Analysis 16, 2 (2012), 265–278.
- Webb Stacy and Jean MacMillan. 1995. Cognitive bias in software engineering. Commun. ACM 38, 6 (1995), 57–63.
- Evaluation of uncertainty quantification in deep learning. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems. Springer, 556–568.
- Robert S Stern. 2010. Prevalence of a history of skin cancer in 2007: results of an incidence-based model. Archives of dermatology 146, 3 (2010), 279–282.
- The role of environmental predictability and costs in relying on automation. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 2535–2544.
- Philip E Tetlock and Dan Gardner. 2016. Superforecasting: The art and science of prediction. Random House.
- Medical AI for Radiology: The Lost Cognitive Perspective. (2023).
- Post-hoc uncertainty calibration for domain drift scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10124–10132.
- Laurence H Tribe. 1970. Trial by mathematics: Precision and ritual in the legal process. Harv. L. Rev. 84 (1970), 1329.
- Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. The lancet oncology 20, 7 (2019), 938–947.
- The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5, 1 (2018), 1–9.
- Amos Tversky and Daniel Kahneman. 1985. The framing of decisions and the psychology of choice. In Behavioral decision making. Springer, 25–41.
- How to Evaluate Trust in AI-Assisted Decision Making? A Survey of Empirical Methodologies. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–39.
- Do humans trust advice more if it comes from ai? an analysis of human-ai interactions. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society. 763–777.
- Uncalibrated Models Can Improve Human-AI Collaboration. arXiv preprint arXiv:2202.05983 (2022).
- Deep learning for identifying metastatic breast cancer. arXiv preprint arXiv:1606.05718 (2016).
- “Brilliant AI Doctor” in Rural Clinics: Challenges in AI-Powered Clinical Decision Support System Deployment. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–18.
- Show or suppress? Managing input uncertainty in machine learning model explanations. Artificial Intelligence 294 (2021), 103456.
- Determination of the selection statistics and best significance level in backward stepwise logistic regression. Communications in Statistics-Simulation and Computation 37, 1 (2007), 62–72.
- Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics. PMLR, 178–190.
- Towards global explanations of convolutional neural networks with concept attribution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8652–8661.
- How do visual explanations foster end users’ appropriate trust in machine learning?. In Proceedings of the 25th international conference on intelligent user interfaces. 189–201.
- Understanding the effect of accuracy on trust in machine learning models. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–12.
- Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 694–699.
- Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 295–305.
- Shiye Cao (3 papers)
- Anqi Liu (51 papers)
- Chien-Ming Huang (31 papers)