Emergent Mind

Compositional Explanations of Neurons

(2006.14032)
Published Jun 24, 2020 in cs.LG , cs.AI , cs.CL , cs.CV , and stat.ML

Abstract

We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts that closely approximate neuron behavior. Compared to prior work that uses atomic labels as explanations, analyzing neurons compositionally allows us to more precisely and expressively characterize their behavior. We use this procedure to answer several questions on interpretability in models for vision and natural language processing. First, we examine the kinds of abstractions learned by neurons. In image classification, we find that many neurons learn highly abstract but semantically coherent visual concepts, while other polysemantic neurons detect multiple unrelated features; in natural language inference (NLI), neurons learn shallow lexical heuristics from dataset biases. Second, we see whether compositional explanations give us insight into model performance: vision neurons that detect human-interpretable concepts are positively correlated with task performance, while NLI neurons that fire for shallow heuristics are negatively correlated with task performance. Finally, we show how compositional explanations provide an accessible way for end users to produce simple "copy-paste" adversarial examples that change model behavior in predictable ways.

Explanation generation using beam search to maximize IoU score with binary masks and primitive concepts.

Overview

  • The paper introduces a method to explain neuron behavior in deep learning models using logical concepts, offering insights into the internal workings beyond simple labels.

  • Their approach involves converting neuron activations to binary masks and building logical forms that match neuron activation patterns through combinatorial search strategies.

  • The method is applied to image classification and natural language inference tasks, demonstrating varying correlations between neuron interpretability and model performance.

Explaining Neurons in Deep Networks with Logical Concepts

Introduction

Deep learning models, especially those used in computer vision and NLP, often behave like black boxes. To tackle this, researchers have been trying to explain the inner workings of these models. The paper we're discussing uses a neat trick to identify logical concepts that approximate neuron behavior. By understanding these concepts, we get a clearer picture of what different neurons are doing. Notably, these explanations go beyond simple labels and offer detailed logical constructs, which provide more precise insights into neuron activities.

Generating Compositional Explanations

The core idea here is to explain neuron behavior using logical forms made from basic concepts. Imagine you have a neuron that activates for dog images. This neuron might also activate for similar but not identical concepts without being strictly a "dog detector."

Here's how the process works:

  • Input and Activation: Take a set of inputs and find which ones activate the neuron.
  • Binary Masks: Convert these activations into binary masks (active/inactive).
  • Logic Formation: Build logical forms (like dog OR cat), incrementally using operations like AND, OR, and NOT.
  • Search for Best Match: Search through these logical forms to find which one best matches the neuron's activation pattern.

By using this compositional search strategy, the model generates explanations that are much richer compared to earlier methods that relied on simple, atomic labels.

Tasks: Vision and NLP

The method is applied to two tasks to showcase its versatility:

  1. Image Classification: Using a ResNet-18 trained on the Places365 dataset, they probed neurons in the final convolutional layer. The Broden dataset provided semantic labels.
  2. Natural Language Inference (NLI): Examined using an LSTM model trained on the SNLI dataset. They probed neurons in the penultimate MLP layer, with features like parts-of-speech and word presence.

Results: Neuron Learning and Model Accuracy

Types of Learned Concepts:

  • Vision Models: Some neurons learned highly abstract, semantically coherent concepts (e.g., "tall structures"), while others were polysemantic, reacting to seemingly unrelated features (e.g., "river or house").
  • NLI Models: Neurons sometimes captured shallow heuristics based on dataset biases. For instance, a neuron might fire when the premise contains the word "man," indicating learned gender biases.

Performance Insights:

  • Vision: Neurons interpreting human-understandable concepts correlated positively with model accuracy. The more interpretable, the better the performance.
  • NLI: The opposite trend appeared. Neurons that could be easily explained often reflected spurious patterns and were negatively correlated with task performance.

Practical Applications: Adversarial Examples

One exciting aspect is how these explanations can be used to craft adversarial examples:

  • Vision: By understanding which neurons contribute most to a specific class, researchers could alter an image subtly to change the model's prediction predictably.
  • NLI: Adding certain words to sentences could manipulate the model into making wrong classifications, showing the potential for both diagnosing and exploiting biases.

Implications and Future Developments

Compositional explanations provide much-needed insight into what deep networks are learning. This has several implications:

  • Diagnosis and Debugging: Understanding neuron behavior can help diagnose model errors and biases.
  • Model Improvement: Could inform regularization techniques to encourage neurons to learn more robust, interpretable features.
  • Adversarial Defense: Crafting adversarial examples that are easy to understand might improve defenses against more complex adversarial attacks.

Conclusion

Explaining neurons in deep networks with logical forms opens up a powerful tool for interpretability. It sheds light on both the strengths and weaknesses of the model, explaining behavior more comprehensively than simple labels. The resulting insights can guide improvements in model design, training, and application, making this a valuable approach for advancing AI interpretability.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.