"How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations (1911.06473v1)

Published 15 Nov 2019 in cs.AI

Abstract: As machine learning black boxes are increasingly being deployed in critical domains such as healthcare and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a human interpretable manner. It has recently become apparent that a high-fidelity explanation of a black box ML model may not accurately reflect the biases in the black box. As a consequence, explanations have the potential to mislead human users into trusting a problematic black box. In this work, we rigorously explore the notion of misleading explanations and how they influence user trust in black-box models. More specifically, we propose a novel theoretical framework for understanding and generating misleading explanations, and carry out a user study with domain experts to demonstrate how these explanations can be used to mislead users. Our work is the first to empirically establish how user trust in black box models can be manipulated via misleading explanations.

Citations (233)

View on Semantic Scholar

Summary

The paper presents a theoretical framework revealing how high-fidelity explanations may hide underlying biases in ML models.
The study reveals that misleading explanations can boost user trust by up to 9.8 times in a real-world bail prediction scenario.
The research extends the MUSE framework to generate selective explanations, highlighting the need for robust and ethical interpretability methods.

Misleading Explanations in Black Box Machine Learning Models

The paper "How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations, explores the notion of misleading explanations in the context of black box ML models. It highlights the critical issue that arises when high-fidelity explanations, which are supposed to elucidate the functioning of complex ML models, potentially mislead users regarding the trustworthiness of these models. The authors present a novel theoretical framework to understand and generate such misleading explanations and support their findings with empirical data derived from a user study involving domain experts.

Theoretical Framework

The theoretical framework posited by the researchers challenges the conventional metrics of explanation fidelity, which typically prioritize reproducing the output of the black box model. The authors argue that a high-fidelity explanation might not capture the underlying biases or errors present in the model. This happens primarily due to correlations among input features that allow explanations to achieve high fidelity without truly reflecting the problematic aspects of the model.

Empirical Evidence and User Study

The paper's key empirical contribution is a study conducted with law and criminal justice experts, utilizing a scenario focused on bail decision predictions. The study found that misleading explanations, specifically designed to align with users' expectations (i.e., including features they deemed relevant while omitting those seen as problematic), significantly increased trust in the model. For instance, participants were 9.8 times more likely to trust the machine when shown misleading explanations that omitted sensitive features like race and gender.

Generating Misleading Explanations

The authors extend the Model Understanding through Subspace Explanations (MUSE) framework to generate explanations meant to mislead. By fine-tuning the MUSE parameters, they ensured that explanations included desired features while excluding prohibited ones, without altering the underlying black box model. This approach allowed them to empirically demonstrate how easy it is to manipulate user trust in these systems.

Implications and Future Directions

The findings from this work underscore the potential for ML systems to inadvertently or deliberately mislead users, especially in high-stakes domains like healthcare or criminal justice. The research raises questions about the ethics and regulation of machine learning explainability. It advocates for caution and suggests that more robust methodologies, perhaps incorporating causal inference techniques, are necessary to ensure explanations do not mislead.

Future research directions could involve developing interactive explanation systems where users can query different aspects of a model's decision-making process, thus reducing the risk of being misled. Additionally, the exploration of fundamental techniques in causal explanation and robust interpretability is critical to enhance trust in AI systems.

In summary, this study shines a light on significant vulnerabilities in current ML explanation techniques and calls for a paradigm shift that accounts for and addresses the risk of misleading stakeholders through deceptively simple explanations. As AI systems become more ingrained in decision-making processes, ensuring the integrity and accuracy of their explanatory mechanisms will be paramount.