The Mythos of Model Interpretability (1606.03490v3)

Published 10 Jun 2016 in cs.LG, cs.AI, cs.CV, cs.NE, and stat.ML

Abstract: Supervised machine learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? We want models to be not only good, but interpretable. And yet the task of interpretation appears underspecified. Papers provide diverse and sometimes non-overlapping motivations for interpretability, and offer myriad notions of what attributes render models interpretable. Despite this ambiguity, many papers proclaim interpretability axiomatically, absent further explanation. In this paper, we seek to refine the discourse on interpretability. First, we examine the motivations underlying interest in interpretability, finding them to be diverse and occasionally discordant. Then, we address model properties and techniques thought to confer interpretability, identifying transparency to humans and post-hoc explanations as competing notions. Throughout, we discuss the feasibility and desirability of different notions, and question the oft-made assertions that linear models are interpretable and that deep neural networks are not.

Citations (3,427)

View on Semantic Scholar

Summary

The paper challenges the conventional belief that simpler models are inherently more interpretable than complex ones by comparing linear models and neural networks.
The paper advocates precise interpretability claims by distinguishing between transparency and post-hoc explanations, emphasizing the need for clear definitions.
The paper highlights the trade-off between interpretability and predictive power, cautioning against over-reliance on post-hoc methods that may mislead.

Unpacking "The Mythos of Model Interpretability"

Introduction

When it comes to machine learning, we often hear the buzzword "interpretability." But what does it actually mean for a model to be interpretable, and why do we care? Zachary Lipton's paper, "The Mythos of Model Interpretability," dives deep into this topic, examining the various motivations behind interpretability and scrutinizing the ways we typically define and achieve it.

What Drives the Need for Interpretability?

Let's start with the 'why.’ According to the paper, the demand for interpretability usually comes up when there's a gap between the performance metrics we typically use (like accuracy) and the real-world stakes of deploying these models. Here are some key motivations:

Trust: Often, we need to trust that a model's predictions are reliable, especially in high-stakes applications like healthcare or criminal justice. But what does "trust" really mean here? It could mean confidence that the model performs well, or it could mean the model makes decisions in a way that's understandable and predictable.
Causality: Researchers sometimes hope their models will reveal causal relationships in data. For example, a predictive model might show an association between smoking and lung cancer, prompting further investigation.
Transferability: In real-world applications, conditions change. Interpretability helps us understand whether a model trained in one setting will perform well in another, different setting.
Informativeness: Sometimes, the purpose of a model isn't just to make accurate predictions but also to provide insights that help human decision-makers.
Fair and Ethical Decision-Making: There are increasing concerns about making sure models don't perpetuate biases, especially in sensitive areas like hiring or criminal justice.

What Makes a Model Interpretable?

The paper breaks down interpretability into two broad categories: transparency and post-hoc explanations.

Transparency

Simulatability: A model is transparent if a person can understand its entire mechanism and step through calculations. Simple models like linear regression or small decision trees are often considered interpretable because they can be comprehended in their entirety.
Decomposability: Here, each part of the model (inputs, parameters, operations) is understandable. For example, in a linear model, the weights can be interpreted as the strength of association between features and the output.
Algorithmic Transparency: This refers to the understanding of how the training algorithm works. Linear models are usually transparent because we know they will converge to a unique solution under certain conditions.

Post-hoc Interpretability

Text Explanations: Some models provide natural language explanations for their predictions. This helps users understand what the model is doing without needing to know its inner workings.
Visualization: Techniques like t-SNE help visualize high-dimensional data in 2D, making it easier to understand what the model has learned.
Local Explanations: Instead of trying to explain the entire model, some approaches focus on explaining individual predictions. For example, saliency maps highlight which parts of an input image most influenced the model's decision.
Explanation by Example: This method shows examples similar to the one being predicted, helping users understand why the model made a particular decision.

Key Takeaways

Lipton’s paper brings up some valuable points:

Linear Models Aren't Always More Interpretable Than Neural Networks: While linear models are often touted as more interpretable, this isn't always the case. For instance, a deep neural network using raw features might be more understandable than a linear model relying on heavily engineered features.
Be Specific About Interpretability Claims: Any claims about a model's interpretability should be qualified. What kind of interpretability is being referred to? Transparency or post-hoc explanations?
Interpretability vs. Predictive Power: Sometimes, insisting on interpretability might lead us to sacrifice predictive performance. It's important to weigh these trade-offs carefully.
Post-hoc Explanations Can Mislead: Relying purely on post-hoc explanations can sometimes be problematic. For example, a method optimized to produce plausible-sounding explanations might inadvertently provide misleading information.

Future Directions

Lipton suggests a few promising avenues for future research:

Developing Richer Metrics: Better performance metrics and loss functions might help bridge the gap between machine learning objectives and real-world needs.
Expanding to Other ML Paradigms: Investigating interpretability within the context of reinforcement learning could offer valuable insights, especially given its capacity to model interactions between algorithms and environments.

Conclusion

"The Mythos of Model Interpretability" urges the machine learning community to approach interpretability with a nuanced and critical mindset. By clearly defining what we mean by interpretability and why we need it, we can make more informed decisions about the models we deploy and ensure they meet our broader objectives.

PDF Markdown

Related Papers

YouTube

Show All Videos