Neural Activation Patterns (NAPs): Visual Explainability of Learned Concepts

Published 20 Jun 2022 in cs.LG and cs.AI | (2206.10611v1)

Abstract: A key to deciphering the inner workings of neural networks is understanding what a model has learned. Promising methods for discovering learned features are based on analyzing activation values, whereby current techniques focus on analyzing high activation values to reveal interesting features on a neuron level. However, analyzing high activation values limits layer-level concept discovery. We present a method that instead takes into account the entire activation distribution. By extracting similar activation profiles within the high-dimensional activation space of a neural network layer, we find groups of inputs that are treated similarly. These input groups represent neural activation patterns (NAPs) and can be used to visualize and interpret learned layer concepts. We release a framework with which NAPs can be extracted from pre-trained models and provide a visual introspection tool that can be used to analyze NAPs. We tested our method with a variety of networks and show how it complements existing methods for analyzing neural network activation values.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (11)

View on Semantic Scholar

Summary

The paper proposes Neural Activation Patterns (NAPs) to capture layer-level learned concepts by clustering entire activation vectors.
It introduces a method combining normalization, spatial aggregation, and HDBSCAN clustering to efficiently extract and visualize these patterns.
The approach offers practical insights into concept evolution, data bias detection, and model comparison across various architectures.

The paper "Neural Activation Patterns (NAPs): Visual Explainability of Learned Concepts" (2206.10611) introduces a method for understanding what concepts a neural network layer has learned by analyzing the patterns formed by the layer's activations across a set of inputs. Unlike methods that focus on maximizing or analyzing activations of individual neurons, NAPs consider the entire activation vector for each input, representing how the input activates the layer as a whole. Inputs with similar activation profiles are grouped together, forming a Neural Activation Pattern (NAP). These NAPs are proposed as representations of learned concepts at the layer level.

The practical implementation of extracting NAPs involves several steps:

Activation Extraction: For a given neural network layer and a set of input data, the activation output vector is obtained for each input. For a set of inputs $X$ , this results in a high-dimensional matrix $A^l$ where each row is the activation vector for an input $x_i$ at layer $l$ . The input data can be a training set, test set, or any relevant dataset. $A^l = n^l(X)$ where $n^l$ is the sub-network up to layer $l$ .
Optional Location Disentanglement: For convolutional layers, the activation output for each unit is a spatial matrix. To identify concepts regardless of their location in the input (e.g., an object being in the top or bottom of an image), the spatial information can be removed by aggregating the activation values across spatial dimensions. The paper explores several aggregation methods:
- Peak feature strength: Uses the maximum activation value in the spatial matrix.
- Feature range: Uses both the minimum and maximum activation values.
- Feature amount: Uses the average activation value.
- Feature amount and spread: Uses the average and standard deviation. The study found that feature amount and feature amount and spread generally result in more extracted NAPs and better separation of concepts compared to peak feature strength or feature range, or not performing aggregation at all.
Activation Unit Equalization: Different units within a layer might have activation values on vastly different scales. To ensure that units with larger activation ranges do not disproportionately influence the pattern detection, each unit's activations across all inputs are normalized. The normalization is done by dividing each activation by the maximum absolute activation value for that specific unit across the entire input set. This preserves the meaning of zero and the relationship between positive and negative activations while scaling values into the range $[-1, 1]$ or $[0, 1]$ .

$\hat{a}^l_{i,u} = \frac{n^l_u(x_i)}{max(abs(n^l_u(X)))}$
Activation Clustering: The core step is to group the normalized activation vectors based on their similarity. This is treated as a clustering problem in the high-dimensional activation space. The paper highlights the need for a clustering algorithm that requires few parameters, is computationally efficient, and can handle noise. HDBSCAN [McInnes2017] is chosen for its ability to automatically determine the number of clusters and its performance. Inputs clustered together are considered to share a Neural Activation Pattern, representing a potential learned concept. The leaves of the cluster tree are preferred to obtain more fine-grained concepts.

The paper provides a practical implementation consisting of a Python package for extracting NAPs from pre-trained models and a web-based visualization tool called the "NAP Magnifying Glass" for analyzing them.

NAP Extraction Package: The Python package allows users to input a model, data, and specify layers to extract NAPs. It handles the steps of activation extraction, optional disentanglement, normalization, and clustering (using HDBSCAN with default or configurable parameters). To manage memory for large models/datasets, activations can be cached to disk.
NAP Magnifying Glass: This interactive visualization interface facilitates the exploration and interpretation of extracted NAPs. It includes:
- Layer Overview: Displays NAPs for a selected layer, showing representative images from each pattern, sorted by cluster persistence. Metadata like predictions or labels can be shown and used for filtering.
- Compare View: Allows side-by-side inspection of selected NAPs from potentially different models or layers, displaying all associated images and their average unit activation profiles.
- Image View: Enables tracing a single image's journey through the network by showing all NAPs it belongs to across different layers, arranged sequentially.

The practical application of NAPs is demonstrated through qualitative results on various models (MNIST, CIFAR10, ResNet50, ResNet50 V2, Inception V3). Key insights gained include:

Concept Differentiation in Early Layers: Even simple models (MNIST) show NAPs in early layers distinguishing different styles of input (e.g., handwritten '1' with different tilts).
Tracing Concept Evolution: Following images across layers shows how the network builds up complex concepts from simpler ones (e.g., generic animals -> animal heads -> specific class like dogs in CIFAR10; objects in nature -> vehicles -> moving van in Resnet50).
Identifying Data Bias: NAPs can reveal biases in the training data, such as finding a NAP in a late layer of Resnet50 V2 consisting solely of images of people holding fish, suggesting the model might have learned the context (person holding something) rather than the fish itself.
Model Comparison: Comparing NAPs across different architectures (Resnet50, Resnet50 V2, Inception V3) for the same concept (e.g., Indigo Bunting bird) can show how different models build concepts differently and potentially correlate with performance differences, with more advanced models showing concept representation across more layers.

The paper discusses that NAPs are not a standalone solution but a complementary tool for layer-level understanding. They require human interpretation, and the discovered NAPs depend on the input data used. Earlier layers may yield NAPs that are harder to interpret due to representing more abstract, low-level features. Despite these limitations, NAPs offer a unique perspective by focusing on the collective behavior of layer units across inputs, providing insights into the learned concepts and potentially revealing issues like data bias or model architecture differences. Future work could explore slicing activations in different ways (e.g., per class or per unit), applying NAPs to different data types and tasks (beyond images and classification), and combining NAPs with attribution methods to better highlight the input features relevant to a pattern.

Markdown Report Issue