Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors

Published 16 Apr 2024 in cs.CV and eess.IV | (2404.10836v1)

Abstract: The aim of this work is to establish how accurately a recent semantic-based foveal active perception model is able to complete visual tasks that are regularly performed by humans, namely, scene exploration and visual search. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations. It has been used previously in scene exploration tasks. In this paper, we revisit the model and extend its application to visual search tasks. To illustrate the benefits of using semantic information in scene exploration and visual search tasks, we compare its performance against traditional saliency-based models. In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model in accurately representing the semantic information present in the visual scene. In visual search experiments, searching for instances of a target class in a visual field containing multiple distractors shows superior performance compared to the saliency-driven model and a random gaze selection algorithm. Our results demonstrate that semantic information, from the top-down, influences visual exploration and search tasks significantly, suggesting a potential area of research for integrating it with traditional bottom-up cues.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. E. E. Stewart, M. Valsecchi, and A. C. Schütz, “A review of interactions between peripheral and foveal vision,” Journal of Vision, vol. 20, no. 12, pp. 2–2, 2020.
  2. R. Bajcsy, Y. Aloimonos, and J. K. Tsotsos, “Revisiting active perception,” Autonomous Robots, vol. 42, pp. 177–196, 2018.
  3. R. P. de Figueiredo and A. Bernardino, “An overview of space-variant and active vision mechanisms for resource-constrained human inspired robotic vision,” Autonomous Robots, pp. 1–17, 2023
  4. S. Frintrop, T. Werner, and G. Martin Garcia, “Traditional saliency reloaded: A good old model in new shape,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 82–90, 2015.
  5. R. Burt, N. N. Thigpen, A. Keil, and J. C. Principe, “Unsupervised foveal vision neural architecture with top-down attention,” Neural Networks, vol. 141, pp. 145–159, 2021.
  6. M. Zhang, J. Feng, K. T. Ma, J. H. Lim, Q. Zhao, and G. Kreiman, “Finding any Waldo with zero-shot invariant and efficient visual search,” Nature Communications, vol. 9, no. 1, p. 3730, 2018.
  7. M. Kümmerer and M. Bethge, “Predicting visual fixations,” Annual Review of Vision Science, vol. 9, pp. 269–291, 2023.
  8. M. Xie, S. Li, R. Zhang, and C. H. Liu, “Dirichlet-based Uncertainty Calibration for Active Domain Adaptation,” The Eleventh International Conference on Learning Representations, 2023.
  9. T. Silva Filho, H. Song, M. Perello-Nieto, R. Santos-Rodriguez, M. Kull, and P. Flach, “Classifier calibration: a survey on how to assess and improve predicted class probabilities,” Machine Learning, vol. 112, no. 9, pp. 3211–3260, Sep. 2023.
  10. L. K. Chan and W. G. Hayward, “Visual search,” Wiley Interdisciplinary Reviews: Cognitive Science, vol. 4, no. 4, pp. 415–429, 2013.
  11. V. J. Traver and A. Bernardino, “A review of log-polar imaging for visual perception in robotics,” Robotics and Autonomous Systems, vol. 58, no. 4, pp. 378–398, 2010.
  12. P. Ozimek, N. Hristozova, L. Balog, and J. P. Siebert, “A space-variant visual pathway model for data efficient deep learning,” Frontiers in Cellular Neuroscience, vol. 13, p. 36, 2019.
  13. H. Lukanov, P. König, and G. Pipa, “Biologically inspired deep learning model for efficient foveal-peripheral vision,” Frontiers in Computational Neuroscience, vol. 15, p. 746204, 2021.
  14. A. F. Almeida, R. Figueiredo, A. Bernardino, and J. Santos-Victor, “Deep networks for human visual attention: A hybrid model using foveal vision,” in ROBOT 2017: Third Iberian Robotics Conference: Volume 2, pp. 117–128, 2018.
  15. R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587, 2014.
  16. J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” University of Washington, 2018.
  17. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision, 2020, pp. 213–229.
  18. R. P. de Figueiredo, A. Bernardino, J. Santos-Victor, and H. Araújo, “On the advantages of foveal mechanisms for active stereo systems in visual search tasks,” Autonomous Robots, vol. 42, pp. 459–476, 2018.
  19. A. Torralba, A. Oliva, M. S. Castelhano, and J. M. Henderson, “Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search.,” Psychological review, vol. 113, no. 4, p. 766, 2006.
  20. M. Assens Reina, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, “Saltinet: Scan-path prediction on 360 degree images using saliency volumes,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2331–2338, 2017.
  21. M. Assens, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, “PathGAN: Visual scanpath prediction with generative adversarial networks,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0, 2018.
  22. X. Sui, Y. Fang, H. Zhu, S. Wang, and Z. Wang, “ScanDMM: A deep markov model of scanpath prediction for 360deg images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6989–6999, 2023.
  23. M. Kümmerer, M. Bethge, and T. S. Wallis, “DeepGaze III: Modeling free-viewing human scanpaths with deep learning,” Journal of Vision, vol. 22, no. 5, pp. 7–7, 2022.
  24. Z. Yang, S. Mondal, S. Ahn, G. Zelinsky, M. Hoai, and D. Samaras, “Predicting Human Attention using Computational Attention,” 2023.
  25. M. Kummerer, T. S. Wallis, and M. Bethge, “Saliency benchmarking made easy: Separating models, maps and metrics,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 770–787, 2018.
  26. Y. Chen, Z. Yang, S. Ahn, D. Samaras, M. Hoai, and G. Zelinsky, “Coco-search18 fixation dataset for predicting goal-directed attention control,” Scientific reports, vol. 11, no. 1, p. 8776, 2021.
  27. T. Judd, F. Durand, and A. Torralba, “A benchmark of computational models of saliency to predict human fixations,” 2012.
  28. A. Borji and L. Itti, “Cat2000: A large scale fixation dataset for boosting saliency research,” 2015.
  29. M. Kümmerer and M. Bethge, “State-of-the-art in human scanpath prediction,” 2021.
  30. G. Elsayed, S. Kornblith, and Q. V. Le, “Saccader: Improving accuracy of hard attention models for vision,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  31. R. Fu, J. Liu, X. Chen, Y. Nie, and W. Xiong, “Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning,” arXiv preprint arXiv:2403.11401, 2024.
  32. T. Minka, “Estimating a Dirichlet distribution,” Massachusetts Institute of Technology, Tech Report, 2000.
  33. J. M. Wolfe, “Visual search: How do we find what we are looking for?,” Annual review of vision science, vol. 6, pp. 539–562, 2020.
  34. S. Rashidi, W. Xu, D. Lin, A. Turpin, L. Kulik, and K. Ehinger, “An active foveated gaze prediction algorithm based on a Bayesian ideal observer,” Pattern Recognition, vol. 143, p. 109694, 2023.

Summary

  • The paper introduces a semantic-based active perception model that integrates object detection and Bayesian methods to update semantic maps for efficient visual tasks.
  • It employs predictive gaze strategies to reduce shifts and improve accuracy in visual search and scene exploration compared to traditional saliency-based models.
  • The approach, despite higher computational demands, shows promise for achieving human-like visual cognition in complex, dynamic environments.

Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors

Introduction

The paper "Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors" introduces an advanced methodology for performing visual tasks on humanoid robots using a combination of semantic information and active perception. The study extends a semantic-based foveal active perception model to address both scene exploration and visual search tasks. By leveraging modern object detectors for identifying objects and updating a semantic scene description over successive fixations, the approach demonstrates superior performance compared to traditional saliency-based models.

Methodology Overview

The proposed methodology integrates semantic data from object detectors with active visual perception to emulate human-like visual cognition. The approach utilizes foveal vision, which increases processing efficiency by focusing on high-resolution central vision while reducing peripheral visual information.

System Architecture

Figure 1

Figure 1: Our general methodological approach to semantic visual search and scene exploration tasks.

Object detection frameworks are employed to generate bounding boxes and classification scores for detected objects. These semantic predictions are fused into a semantic map using Bayesian methods, specifically using Dirichlet distributions that account for classification uncertainties. Figure 2

Figure 2: Representation of the dependencies between the variables that participate in the methodology.

Active Perception Strategy

Active perception in this model involves determining the next gaze location to minimize uncertainty, measured by metrics such as Kullback-Leibler divergence or entropy. For predictive tasks like visual search, the model simulates future updates of the semantic map to predict and select the next best fixation point.

Experimental Results

The methodology was validated through a series of experiments comparing it to traditional saliency-based models and random selection algorithms. The experiments focused on visual search and scene exploration tasks in complex visual environments with multiple distractors.

The semantic model displayed superior performance in visual search tasks compared to saliency-based models like VOCUS2. The predictive approach significantly improved accuracy, reducing the number of gaze shifts required to locate target objects. Figure 3

Figure 3: Comparison between the mean values of the cumulative performance for predictive and non-predictive approaches.

Scene Exploration

For scene exploration, the semantic model outperformed saliency-based models in accurately mapping the semantic content of scenes with fewer actions. The use of foveal calibration added some benefits to the non-predictive approach, improving its efficacy. Figure 4

Figure 4: Mean values of the average success rate indicating advantages of the semantic-based method over the random and VOCUS2 approaches.

Computational Considerations

The computational requirements for the proposed methodology are higher than those for traditional saliency models, primarily due to the complexity of semantic map simulations and updates. However, the semantic approach provides a more accurate representation of scene content with fewer actions, compensating for its higher computational cost in applications where precision is critical.

Conclusion

The semantic-based active perception model for humanoid visual tasks offers a promising approach to achieving human-level visual cognition in robots. By integrating semantic information with active perception, the model effectively balances processing efficiency with task accuracy. Future work may explore further integration with top-down cognitive models and extensions to real-world mobile robotic platforms.

Overall, this research presents a robust framework for addressing complex visual tasks in dynamic environments, highlighting the potential of semantic information to enhance robotic perception systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.