Visualizing and Understanding Contrastive Learning (2206.09753v3)
Abstract: Contrastive learning has revolutionized the field of computer vision, learning rich representations from unlabeled data, which generalize well to diverse vision tasks. Consequently, it has become increasingly important to explain these approaches and understand their inner workings mechanisms. Given that contrastive models are trained with interdependent and interacting inputs and aim to learn invariance through data augmentation, the existing methods for explaining single-image systems (e.g., image classification models) are inadequate as they fail to account for these factors and typically assume independent inputs. Additionally, there is a lack of evaluation metrics designed to assess pairs of explanations, and no analytical studies have been conducted to investigate the effectiveness of different techniques used to explaining contrastive learning. In this work, we design visual explanation methods that contribute towards understanding similarity learning tasks from pairs of images. We further adapt existing metrics, used to evaluate visual explanations of image classification systems, to suit pairs of explanations and evaluate our proposed methods with these metrics. Finally, we present a thorough analysis of visual explainability methods for contrastive learning, establish their correlation with downstream tasks and demonstrate the potential of our approaches to investigate their merits and drawbacks.
- G. Koch, “Siamese neural networks for one-shot image recognition,” Ph.D. dissertation, University of Toronto, 2015.
- T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, “Big self-supervised models are strong semi-supervised learners,” Advances in neural information processing systems, vol. 33, pp. 22 243–22 255, 2020.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick, “Momentum contrast for unsupervised visual representation learning,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726–9735, 2020.
- M. Caron, H. Touvron, I. Misra, H. J’egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9630–9640, 2021.
- T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, pp. 318–327, 2017.
- A. Shrivastava, A. K. Gupta, and R. B. Girshick, “Training region-based object detectors with online hard example mining,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 761–769, 2016.
- S. A. Bargal, A. Zunino, V. Petsiuk, J. Zhang, K. Saenko, V. Murino, and S. Sclaroff, “Guided zoom: Zooming into network evidence to refine fine-grained model decisions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, pp. 4196–4202, 2021.
- S. Bach, A. Binder, G. Montavon, K.-R. Müller, and W. Samek, “Analyzing classifiers: Fisher vectors and deep neural networks,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2912–2920, 2016.
- S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller, “Unmasking clever hans predictors and assessing what machines really learn,” Nature Communications, vol. 10, 2019.
- B. Goodman and S. Flaxman, “European union regulations on algorithmic decision-making and a "right to explanation",” AI Mag., vol. 38, pp. 50–57, 2017.
- M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in ECCV, 2014.
- M. T. Ribeiro, S. Singh, and C. Guestrin, “"why should i trust you?": Explaining the predictions of any classifier,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
- B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929, 2016.
- R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” International Journal of Computer Vision, vol. 128, pp. 336–359, 2019.
- S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PLOS ONE, vol. 10, no. 7, pp. 1–46, 07 2015. [Online]. Available: https://doi.org/10.1371/journal.pone.0130140
- K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” CoRR, vol. abs/1312.6034, 2014.
- D. Smilkov, N. Thorat, B. Kim, F. B. Viégas, and M. Wattenberg, “Smoothgrad: removing noise by adding noise,” ArXiv, vol. abs/1706.03825, 2017.
- M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” ArXiv, vol. abs/1703.01365, 2017.
- O. Eberle, J. Büttner, F. Kräutli, K.-R. Müller, M. Valleriani, and G. Montavon, “Building and interpreting deep similarity models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, pp. 1149–1161, 2022.
- L. Arras, G. Montavon, K.-R. Müller, and W. Samek, “Explaining recurrent neural network predictions in sentiment analysis,” in WASSA@EMNLP, 2017.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456.
- M. Guillemot, C. Heusèle, R. Korichi, S. Schnebert, and L. Chen, “Breaking batch normalization for better explainability of deep neural networks through layer-wise relevance propagation,” ArXiv, vol. abs/2002.11018, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 5797–5808.
- H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 782–791, 2021.
- A. Stylianou, R. Souvenir, and R. Pless, “Visualizing deep similarity networks,” 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2029–2037, 2019.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- X. Chen and K. He, “Exploring simple siamese representation learning,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15 745–15 753, 2021.
- J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in ICML, 2021.
- X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9620–9629, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:233024948
- V. Pillai, S. A. Koohpayegani, A. Ouligian, D. Fong, and H. Pirsiavash, “Consistent explanations by contrastive learning,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 203–10 212, 2022.
- S. Baluja, “Hiding images in plain sight: Deep steganography,” in Neural Information Processing Systems, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:29764034
- W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6479–6488, 2018.
- O. Köpüklü, J. ming Zheng, H. Xu, and G. Rigoll, “Driver anomaly detection: A dataset and contrastive learning approach,” 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 91–100, 2021.
- C.-F. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 347–356, 2021.
- S. Yan, X. Xiong, A. Arnab, Z. Lu, M. Zhang, C. Sun, and C. Schmid, “Multiview transformers for video recognition,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3323–3333, 2022.
- S. Shen, L. H. Li, H. Tan, M. Bansal, A. Rohrbach, K.-W. Chang, Z. Yao, and K. Keutzer, “How much can clip benefit vision-and-language tasks?” ArXiv, vol. abs/2107.06383, 2022.
- F. Sammani, T. Mukherjee, and N. Deligiannis, “Nlx-gpt: A model for natural language explanations in vision and vision-language tasks,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8312–8322, 2022.
- V. Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,” in BMVC, 2018.
- A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847, 2018.
- C.-K. Yeh, C.-Y. Hsieh, A. S. Suggala, D. I. Inouye, and P. Ravikumar, “How sensitive are sensitivity-based explanations?” ArXiv, vol. abs/1901.09392, 2019.
- H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convolutional neural networks,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 111–119, 2020.
- J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for simplicity: The all convolutional net,” CoRR, vol. abs/1412.6806, 2015.
- H. J. Kwon, H. I. Koo, J. W. Soh, and N. I. Cho, “Inverse-based approach to explaining and visualizing convolutional neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, pp. 7318–7329, 2021.
- J. R. Lee, S. Kim, I. Y. Park, T. Eo, and D. Hwang, “Relevance-cam: Your model already knows where to look,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14 939–14 948, 2021.
- R. L. Draelos and L. Carin, “Use hirescam instead of grad-cam for faithful explanations of convolutional neural networks,” ArXiv, vol. abs/2011.08891, 2020. [Online]. Available: https://arxiv.org/abs/2011.08891
- P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y. Wei, “Layercam: Exploring hierarchical class activation maps for localization,” IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021.
- J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, “Sanity checks for saliency maps,” Advances in neural information processing systems, vol. 31, 2018.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, pp. 211–252, 2015.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, pp. 2278–2324, 1998.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
- A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5188–5196, 2015.
- L. Ericsson, H. Gouk, and T. M. Hospedales, “How well do self-supervised models transfer?” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410–5419, 2021.