Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contextual Encoder-Decoder Network for Visual Saliency Prediction (1902.06634v4)

Published 18 Feb 2019 in cs.CV

Abstract: Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder-decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive and consistent results across multiple evaluation metrics on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on five datasets and selected examples. Compared to state of the art approaches, the network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources, such as (virtual) robotic systems, to estimate human fixations across complex natural scenes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Integrating visual information from successive fixations, Science 215 (1982) 192–194.
  2. D. E. Irwin, Information integration across saccadic eye movements, Cognitive Psychology 23 (1991) 420–456.
  3. M. I. Posner, Orienting of attention, Quarterly Journal of Experimental Psychology 32 (1980) 3–25.
  4. P. Lennie, The cost of cortical computation, Current Biology 13 (2003) 493–497.
  5. A. Cowey, E. Rolls, Human cortical magnification factor and its relation to visual acuity, Experimental Brain Research 21 (1974) 447–454.
  6. Grating visibility as a function of orientation and retinal eccentricity, Vision Research 15 (1975) 239–244.
  7. Emergence of foveal image sampling from learning to attend in visual scenes, arXiv preprint arXiv:1611.09430 (2016).
  8. K. R. Gegenfurtner, The interaction between vision and eye movements, Perception 45 (2016) 1333–1357.
  9. A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 1254–1259.
  10. A. M. Treisman, G. Gelade, A feature-integration theory of attention, Cognitive Psychology 12 (1980) 97–136.
  11. C. Koch, S. Ullman, Shifts in selective visual attention: Towards the underlying neural circuitry, Human Neurobiology 4 (1985) 219–227.
  12. ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems 25 (2012) 1097–1105.
  13. Objects predict fixations better than early saliency, Journal of Vision 8 (2008) 18.
  14. A. Nuthmann, J. M. Henderson, Object-based attentional selection in scene viewing, Journal of Vision 10 (2010) 20.
  15. Running large-scale simulations on the Neurorobotics Platform to understand vision – the case of visual crowding, Frontiers in Neurorobotics 13 (2019) 33.
  16. SUN: A Bayesian framework for saliency using natural statistics, Journal of Vision 8 (2008) 32.
  17. Graph-based visual saliency, Advances in Neural Information Processing Systems 19 (2006) 545–552.
  18. X. Hou, L. Zhang, Saliency detection: A spectral residual approach, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2007) 1–8.
  19. Faces and text attract gaze independent of the task: Experimental data and computer model, Journal of Vision 9 (2009) 10.
  20. Learning to predict where humans look, Proceedings of the International Conference on Computer Vision (2009) 2106–2113.
  21. Large-scale optimization of hierarchical features for saliency prediction in natural images, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014) 2798–2805.
  22. ImageNet: A large-scale hierarchical image database, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009) 248–255.
  23. DeCAF: A deep convolutional activation feature for generic visual recognition, Proceedings of the International Conference on Machine Learning (2014) 647–655.
  24. DeepGaze I: Boosting saliency prediction with feature maps trained on ImageNet, arXiv preprint arXiv:1411.1045 (2014).
  25. DeepGaze II: Reading fixations from deep features trained on object recognition, arXiv preprint arXiv:1610.01563 (2016).
  26. A deep multi-level network for saliency prediction, Proceedings of the International Conference on Pattern Recognition (2016) 3488–3493.
  27. T. Oyama, T. Yamanaka, Influence of image classification accuracy on saliency map estimation, arXiv preprint arXiv:1807.10657 (2018).
  28. SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks, Proceedings of the International Conference on Computer Vision (2015) 262–270.
  29. Predicting human eye fixations via an LSTM-based saliency attentive model, IEEE Transactions on Image Processing 27 (2018) 5142–5154.
  30. N. Liu, J. Han, A deep spatial contextual long-term recurrent convolutional network for saliency detection, IEEE Transactions on Image Processing 27 (2018) 3264–3274.
  31. A. Borji, Saliency prediction in the deep learning era: An empirical investigation, arXiv preprint arXiv:1810.03716 (2018).
  32. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
  33. F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv:1511.07122 (2015).
  34. Hypercolumns for object segmentation and fine-grained localization, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 447–456.
  35. Fully convolutional networks for semantic segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 3431–3440.
  36. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2018) 834–848.
  37. Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search, Psychological Review 113 (2006) 766.
  38. Rethinking atrous convolution for semantic image segmentation, arXiv preprint arXiv:1706.05587 (2017).
  39. SalGAN: Visual saliency prediction with generative adversarial networks, arXiv preprint arXiv:1701.01081 (2017).
  40. Deconvolution and checkerboard artifacts, Distill 1 (2016) e3.
  41. X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the International Conference on Artificial Intelligence and Statistics (2010) 249–256.
  42. On the importance of initialization and momentum in deep learning, Proceedings of the International Conference on Machine Learning (2013) 1139–1147.
  43. Places: A 10 million image database for scene recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2017) 1452–1464.
  44. End-to-end saliency mapping via probability distribution prediction, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 5753–5761.
  45. D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
  46. D. R. Wilson, T. R. Martinez, The general inefficiency of batch training for gradient descent learning, Neural Networks 16 (2003) 1429–1451.
  47. A. Borji, L. Itti, CAT2000: A large scale fixation dataset for boosting saliency research, arXiv preprint arXiv:1505.03581 (2015).
  48. Saliency detection via graph-based manifold ranking, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013) 3166–3173.
  49. The secrets of salient object segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014) 280–287.
  50. Predicting human gaze beyond pixels, Journal of Vision 14 (2014) 28.
  51. TurkerGaze: Crowdsourcing saliency with webcam based eye tracking, arXiv preprint arXiv:1504.06755 (2015).
  52. SALICON: Saliency in context, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 1072–1080.
  53. Saliency revisited: Analysis of mouse movements versus fixations, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 6354–6362.
  54. Saliency benchmarking made easy: Separating models, maps and metrics, Proceedings of the European Conference on Computer Vision (2018) 770–787.
  55. Saliency and human fixations: State-of-the-art and study of comparison metrics, Proceedings of the International Conference on Computer Vision (2013) 1153–1160.
  56. What do different evaluation metrics tell us about saliency models?, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2018) 740–757.
  57. DeepFix: A fully convolutional neural network for predicting human eye fixations, IEEE Transactions on Image Processing 26 (2017) 4446–4456.
  58. S. Jia, EML-NET: An expandable multi-layer network for saliency prediction, arXiv preprint arXiv:1805.01047 (2018).
  59. Densely connected convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 2261–2269.
  60. Dual path networks, Advances in Neural Information Processing Systems 30 (2017) 4467–4475.
  61. Invariance analysis of saliency models versus human gaze during scene free viewing, arXiv preprint arXiv:1810.04456 (2018).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Alexander Kroner (1 paper)
  2. Mario Senden (4 papers)
  3. Kurt Driessens (7 papers)
  4. Rainer Goebel (4 papers)
Citations (179)

Summary

  • The paper presents a CNN-based encoder-decoder architecture with ASPP that integrates multi-scale and contextual features for improved saliency prediction.
  • The methodology employs dilated convolutions and reduced downsampling to preserve spatial details, enabling accurate mapping of human visual attention.
  • Evaluation on datasets like MIT1003 and CAT2000 shows competitive results with fewer parameters, highlighting its potential for real-time applications.

Contextual Encoder-Decoder Network for Visual Saliency Prediction

The paper "Contextual Encoder-Decoder Network for Visual Saliency Prediction" introduces an advanced approach to predicting visual saliency using deep learning methodologies. The authors propose a novel convolutional neural network (CNN) architecture that leverages multi-scale feature extraction and contextual information to predict salient regions in natural images. This work addresses the complexities involved in capturing human fixation patterns, which are influenced by both low-level visual features and high-level semantics.

Methodology and Architecture

The proposed architecture forms an encoder-decoder network, incorporating a VGG16-based image encoder that preserves spatial features by minimizing downsampling operations. Striding is removed in the deeper layers, and dilated convolutions are employed to maintain spatial resolution while enhancing receptive fields. Multi-level activations are concatenated, contributing to the saliency predictions with mid- and high-level feature responses.

A notable addition to this architecture is an Atrous Spatial Pyramid Pooling (ASPP) module, which is responsible for capturing multi-scale information via parallel convolution layers with increasing dilation rates. Global averaging is used to integrate scene context, which is beneficial for tasks requiring a comprehensive understanding of spatial arrangements and object relationships within a scene.

The decoder reconstructs image resolution through successive upsampling layers, yielding saliency maps that predict human gaze distributions. Training utilizes the Kullback-Leibler divergence as a loss function, allowing the network to refine probability distributions of gaze allocation.

Results and Evaluation

Quantitative evaluation across multiple datasets—such as MIT1003 and CAT2000—reveals that the proposed model achieves competitive performance against state-of-the-art approaches, particularly on metrics like AUC-Judd, sAUC, and KLD. The architecture's efficiency is highlighted by its fewer trainable parameters compared to deeper networks, indicating potential for real-time applications and integration into systems with limited computational capacity.

Qualitative analyses demonstrate the model's capacity to prioritize semantic image regions, outperforming models based solely on low-level feature contrasts. This accomplishment underscores the importance of contextual and multi-scale processing in replicating human-like attention mechanisms.

Implications and Future Directions

The research demonstrates an effective strategy for visual saliency prediction by integrating multi-scale and contextual cue processing into a unified network. The approach highlights the potential for deploying such models in applications like virtual reality and autonomous robotics, where attention prediction can improve interaction fidelity and environmental understanding. The model's lightweight structure further appeals to real-time applications with stringent resource constraints.

Future research directions could explore the integration of more sophisticated object recognition models, enhancing robustness to complex scenes and implicit gaze cues. The extension of this architecture to other pretrained backbones could validate its flexibility and improve semantic feature extraction altogether.

In conclusion, this paper contributes a robust interpretation of saliency modeling by fusing deep representation learning with multiscale, context-driven cue processing. This addresses inherent challenges in accurately predicting human visual attention, paving the way for continued advancements in cognitive and computer vision domains.

X Twitter Logo Streamline Icon: https://streamlinehq.com