Contextual Encoder-Decoder Network for Visual Saliency Prediction (1902.06634v4)
Abstract: Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder-decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive and consistent results across multiple evaluation metrics on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on five datasets and selected examples. Compared to state of the art approaches, the network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources, such as (virtual) robotic systems, to estimate human fixations across complex natural scenes.
- Integrating visual information from successive fixations, Science 215 (1982) 192–194.
- D. E. Irwin, Information integration across saccadic eye movements, Cognitive Psychology 23 (1991) 420–456.
- M. I. Posner, Orienting of attention, Quarterly Journal of Experimental Psychology 32 (1980) 3–25.
- P. Lennie, The cost of cortical computation, Current Biology 13 (2003) 493–497.
- A. Cowey, E. Rolls, Human cortical magnification factor and its relation to visual acuity, Experimental Brain Research 21 (1974) 447–454.
- Grating visibility as a function of orientation and retinal eccentricity, Vision Research 15 (1975) 239–244.
- Emergence of foveal image sampling from learning to attend in visual scenes, arXiv preprint arXiv:1611.09430 (2016).
- K. R. Gegenfurtner, The interaction between vision and eye movements, Perception 45 (2016) 1333–1357.
- A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 1254–1259.
- A. M. Treisman, G. Gelade, A feature-integration theory of attention, Cognitive Psychology 12 (1980) 97–136.
- C. Koch, S. Ullman, Shifts in selective visual attention: Towards the underlying neural circuitry, Human Neurobiology 4 (1985) 219–227.
- ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems 25 (2012) 1097–1105.
- Objects predict fixations better than early saliency, Journal of Vision 8 (2008) 18.
- A. Nuthmann, J. M. Henderson, Object-based attentional selection in scene viewing, Journal of Vision 10 (2010) 20.
- Running large-scale simulations on the Neurorobotics Platform to understand vision – the case of visual crowding, Frontiers in Neurorobotics 13 (2019) 33.
- SUN: A Bayesian framework for saliency using natural statistics, Journal of Vision 8 (2008) 32.
- Graph-based visual saliency, Advances in Neural Information Processing Systems 19 (2006) 545–552.
- X. Hou, L. Zhang, Saliency detection: A spectral residual approach, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2007) 1–8.
- Faces and text attract gaze independent of the task: Experimental data and computer model, Journal of Vision 9 (2009) 10.
- Learning to predict where humans look, Proceedings of the International Conference on Computer Vision (2009) 2106–2113.
- Large-scale optimization of hierarchical features for saliency prediction in natural images, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014) 2798–2805.
- ImageNet: A large-scale hierarchical image database, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009) 248–255.
- DeCAF: A deep convolutional activation feature for generic visual recognition, Proceedings of the International Conference on Machine Learning (2014) 647–655.
- DeepGaze I: Boosting saliency prediction with feature maps trained on ImageNet, arXiv preprint arXiv:1411.1045 (2014).
- DeepGaze II: Reading fixations from deep features trained on object recognition, arXiv preprint arXiv:1610.01563 (2016).
- A deep multi-level network for saliency prediction, Proceedings of the International Conference on Pattern Recognition (2016) 3488–3493.
- T. Oyama, T. Yamanaka, Influence of image classification accuracy on saliency map estimation, arXiv preprint arXiv:1807.10657 (2018).
- SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks, Proceedings of the International Conference on Computer Vision (2015) 262–270.
- Predicting human eye fixations via an LSTM-based saliency attentive model, IEEE Transactions on Image Processing 27 (2018) 5142–5154.
- N. Liu, J. Han, A deep spatial contextual long-term recurrent convolutional network for saliency detection, IEEE Transactions on Image Processing 27 (2018) 3264–3274.
- A. Borji, Saliency prediction in the deep learning era: An empirical investigation, arXiv preprint arXiv:1810.03716 (2018).
- K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
- F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv:1511.07122 (2015).
- Hypercolumns for object segmentation and fine-grained localization, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 447–456.
- Fully convolutional networks for semantic segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 3431–3440.
- DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2018) 834–848.
- Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search, Psychological Review 113 (2006) 766.
- Rethinking atrous convolution for semantic image segmentation, arXiv preprint arXiv:1706.05587 (2017).
- SalGAN: Visual saliency prediction with generative adversarial networks, arXiv preprint arXiv:1701.01081 (2017).
- Deconvolution and checkerboard artifacts, Distill 1 (2016) e3.
- X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the International Conference on Artificial Intelligence and Statistics (2010) 249–256.
- On the importance of initialization and momentum in deep learning, Proceedings of the International Conference on Machine Learning (2013) 1139–1147.
- Places: A 10 million image database for scene recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2017) 1452–1464.
- End-to-end saliency mapping via probability distribution prediction, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 5753–5761.
- D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
- D. R. Wilson, T. R. Martinez, The general inefficiency of batch training for gradient descent learning, Neural Networks 16 (2003) 1429–1451.
- A. Borji, L. Itti, CAT2000: A large scale fixation dataset for boosting saliency research, arXiv preprint arXiv:1505.03581 (2015).
- Saliency detection via graph-based manifold ranking, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013) 3166–3173.
- The secrets of salient object segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014) 280–287.
- Predicting human gaze beyond pixels, Journal of Vision 14 (2014) 28.
- TurkerGaze: Crowdsourcing saliency with webcam based eye tracking, arXiv preprint arXiv:1504.06755 (2015).
- SALICON: Saliency in context, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 1072–1080.
- Saliency revisited: Analysis of mouse movements versus fixations, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 6354–6362.
- Saliency benchmarking made easy: Separating models, maps and metrics, Proceedings of the European Conference on Computer Vision (2018) 770–787.
- Saliency and human fixations: State-of-the-art and study of comparison metrics, Proceedings of the International Conference on Computer Vision (2013) 1153–1160.
- What do different evaluation metrics tell us about saliency models?, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2018) 740–757.
- DeepFix: A fully convolutional neural network for predicting human eye fixations, IEEE Transactions on Image Processing 26 (2017) 4446–4456.
- S. Jia, EML-NET: An expandable multi-layer network for saliency prediction, arXiv preprint arXiv:1805.01047 (2018).
- Densely connected convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 2261–2269.
- Dual path networks, Advances in Neural Information Processing Systems 30 (2017) 4467–4475.
- Invariance analysis of saliency models versus human gaze during scene free viewing, arXiv preprint arXiv:1810.04456 (2018).
- Alexander Kroner (1 paper)
- Mario Senden (4 papers)
- Kurt Driessens (7 papers)
- Rainer Goebel (4 papers)