Manipulating Feature Visualizations with Gradient Slingshots (2401.06122v3)
Abstract: Feature Visualization (FV) is a widely used technique for interpreting the concepts learned by Deep Neural Networks (DNNs), which synthesizes input patterns that maximally activate a given feature. Despite its popularity, the trustworthiness of FV explanations has received limited attention. In this paper, we introduce a novel method, Gradient Slingshots, that enables manipulation of FV without modifying the model architecture or significantly degrading its performance. By shaping new trajectories in the off-distribution regions of the activation landscape of a feature, we coerce the optimization process to converge in a predefined visualization. We evaluate our approach on several DNN architectures, demonstrating its ability to replace faithfuls FV with arbitrary targets. These results expose a critical vulnerability: auditors relying solely on FV may accept entirely fabricated explanations. To mitigate this risk, we propose a straightforward defense and quantitatively demonstrate its effectiveness.
- From attribution maps to human-understandable explanations through concept relevance propagation. Nature Machine Intelligence, 5(9):1006–1019, 2023.
- Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE access, 6:52138–52160, 2018.
- Sanity checks for saliency maps. In Advances in Neural Information Proccessing Systems (NIPS), 2018.
- Fairwashing explanations with off-manifold detergent. In International Conference on Machine Learning, pages 314–323. PMLR, 2020.
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
- The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning, pages 342–350. PMLR, 2017.
- Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017.
- Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale, 2022.
- Natural images are more informative for interpreting cnn activations than state-of-the-art synthetic feature visualizations. In NeurIPS 2020 Workshop SVRHM, 2020.
- DORA: Exploring outlier representations in deep neural networks. Transactions on Machine Learning Research, 2023a.
- Finding spurious correlations with function-semantic contrast analysis. In World Conference on Explainable Artificial Intelligence, pages 549–572. Springer, 2023b.
- Labeling neural representations with inverse recognition. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c.
- Thread: Circuits. Distill, 2020. https://distill.pub/2020/circuits.
- Red teaming deep neural networks with feature synthesis tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Towards interpretability of segmentation networks by analyzing deepdreams. In Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support: Second International Workshop, iMIMIC 2019, and 9th International Workshop, ML-CDS 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 17, 2019, Proceedings 9, pages 56–63. Springer, 2019.
- Traffic sign recognition and analysis for intelligent vehicles. Image Vis. Comput., 21(3):247–258, 2003.
- Mayukh Deb. Feature visualization library for pytorch. https://github.com/Mayukhdeb/torch-dreams, 2021.
- Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
- Explanations can be manipulated and geometry is to blame. Advances in neural information processing systems, 32, 2019.
- The power of depth for feedforward neural networks. In 29th Annual Conference on Learning Theory, pages 907–940, Columbia University, New York, New York, USA, 2016. PMLR.
- Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
- Unlocking feature visualization for deeper networks with magnitude constrained optimization. arXiv preprint arXiv:2306.06805, 2023a.
- Craft: Concept recursive activation factorization for explainability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2711–2721, 2023b.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
- Don’t trust your eyes: on the (un) reliability of feature visualizations. arXiv preprint arXiv:2306.04719, 2023.
- Interpretation of neural networks is fragile. In AAAI Conference on Artificial Intelligence, 2017.
- Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
- Visualizing the diversity of representations learned by bayesian neural networks. Transactions on Machine Learning Research, 2023.
- Fooling neural network interpretations via adversarial model manipulation. Advances in Neural Information Processing Systems, 32, 2019.
- Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2022.
- Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- On feature learning in the presence of spurious correlations. Advances in Neural Information Processing Systems, 35:38516–38532, 2022.
- Identifying interpretable subspaces in image representations. 2023.
- The (un)reliability of saliency methods. In Explainable AI, 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Understanding black-box predictions via influence functions. In Proc. of the International Conference on Machine Learning (ICML), 2017.
- Learning multiple layers of features from tiny images. Toronto, ON, Canada, 2009.
- Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2012.
- Unmasking clever hans predictors and assessing what machines really learn. Nature communications, 10(1):1096, 2019.
- Deep learning. nature, 521(7553):436–444, 2015.
- Building data-driven models with microstructural images: Generalization and interpretability. Materials Discovery, 10:19–28, 2017.
- Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019a.
- Decoupled weight decay regularization. International Conference on Learning Representations, 2019b.
- Adversarial attacks on feature visualization methods. In NeurIPS ML Safety Workshop, 2022.
- Differentiable image parameterizations. Distill, 3(7):e12, 2018.
- Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33:17153–17163, 2020.
- Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in neural information processing systems, 29, 2016.
- Understanding neural networks via feature visualization: A survey. Explainable AI: interpreting, explaining and visualizing deep learning, pages 55–76, 2019.
- Feature visualization. Distill, 2017. https://distill.pub/2017/feature-visualization.
- A survey on deep learning in medicine: Why, how and when? Information Fusion, 66:111–137, 2021.
- Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
- Explainable AI: interpreting, explaining and visualizing deep learning. Springer Nature, 2019.
- Learning important features through propagating activation differences. In Proc. of the International Conference on Machine Learning (ICML), pages 3145–3153, 2017.
- Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.
- Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Kenneth O Stanley. Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines, 8:131–162, 2007.
- Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017.
- Tensorflow. lucid. https://github.com/tensorflow/lucid, 2017.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
- Toward understanding acceleration-based activity recognition neural networks with activation maximization. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
- Understanding deep learning (still) requires rethinking generalization. Commun. ACM, 64(3):107–115, 2021.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.