Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Modality Prompts for Arbitrary Modality Salient Object Detection (2405.03351v1)

Published 6 May 2024 in cs.CV

Abstract: This paper delves into the task of arbitrary modality salient object detection (AM SOD), aiming to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD, ie more diverse modality discrepancies caused by varying modality types that need to be processed, and dynamic fusion design caused by an uncertain number of modalities present in the inputs of multimodal fusion strategy. Specifically, inspired by prompt learning's ability of aligning the distributions of pre-trained models to the characteristic of downstream tasks by learning some prompts, MAT will first present a modality-adaptive feature extractor (MAFE) to tackle the diverse modality discrepancies by introducing a modality prompt for each modality. In the training stage, a new modality translation contractive (MTC) loss will be further designed to assist MAFE in learning those modality-distinguishable modality prompts. Accordingly, in the testing stage, MAFE can employ those learned modality prompts to adaptively adjust its feature space according to the characteristics of the input modalities, thus being able to extract discriminative unimodal features. Then, MAFE will present a channel-wise and spatial-wise fusion hybrid (CSFH) strategy to meet the demand for dynamic fusion. For that, CSFH dedicates a channel-wise dynamic fusion module (CDFM) and a novel spatial-wise dynamic fusion module (SDFM) to fuse the unimodal features from varying numbers of modalities and meanwhile effectively capture cross-modal complementary semantic and detail information, respectively. Moreover, CSFH will carefully align CDFM and SDFM to different levels of unimodal features based on their characteristics for more effective complementary information exploitation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. A. Li, J. Zhang, Y. Lv, B. Liu, T. Zhang, and Y. Dai, “Uncertainty-aware joint salient object and camouflaged object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 066–10 076.
  2. Z. Yao and L. Wang, “Boundary information progressive guidance network for salient object detection,” IEEE Transactions on Multimedia, vol. 24, pp. 4236–4249, 2022.
  3. Q. Wang, Y. Liu, Z. Xiong, and Y. Yuan, “Hybrid feature aligned network for salient object detection in optical remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022.
  4. L. Zhang, Q. Zhang, and R. Zhao, “Progressive dual-attention residual network for salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 9, pp. 5902–5915, 2022.
  5. N. Huang, Y. Luo, Q. Zhang, and J. Han, “Discriminative unimodal feature selection and fusion for rgb-d salient object detection,” Pattern Recognition, vol. 122, p. 108359, 2022.
  6. Q. Zhang, T. Xiao, N. Huang, D. Zhang, and J. Han, “Revisiting feature fusion for rgb-t salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 5, pp. 1804–1818, 2021.
  7. K. Song, J. Wang, Y. Bao, L. Huang, and Y. Yan, “A novel visible-depth-thermal image dataset of salient object detection for robotic visual perception,” IEEE/ASME Transactions on Mechatronics, vol. 28, no. 3, pp. 1558–1569, 2023.
  8. Z. Xie, F. Shao, G. Chen, H. Chen, Q. Jiang, X. Meng, and Y.-S. Ho, “Cross-modality double bidirectional interaction and fusion network for rgb-t salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, pp. 4149–4163, 2023.
  9. N. Huang, Q. Jiao, Q. Zhang, and J. Han, “Middle-level feature fusion for lightweight rgb-d salient object detection,” IEEE Transactions on Image Processing, vol. 31, pp. 6621–6634, 2022.
  10. Y. Hao, L. Dong, F. Wei, and K. Xu, “Self-attention attribution: Interpreting information interactions inside transformer,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 12 963–12 971, 2021.
  11. D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Feature pyramid transformer,” in Proceedings of the European Conference on Computer Vision, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., 2020, pp. 323–339.
  12. W. Zhang, Z. Huang, G. Luo, T. Chen, X. Wang, W. Liu, G. Yu, and C. Shen, “Topformer: Token pyramid transformer for mobile semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 083–12 093.
  13. Y. Yi, N. Zhang, W. Zhou, Y. Shi, G. Xie, and J. Wang, “Gponet: A two-stream gated progressive optimization network for salient object detection,” Pattern Recognition, vol. 150, p. 110330, 2024.
  14. S. Zhou, J. Wang, L. Wang, J. Zhang, F. Wang, D. Huang, and N. Zheng, “Hierarchical and interactive refinement network for edge-preserving salient object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 1–14, 2021.
  15. Q. Zhang, M. Duanmu, Y. Luo, Y. Liu, and J. Han, “Engaging part-whole hierarchies and contrast cues for salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3644–3658, 2022.
  16. X. Zhao, Y. Pang, L. Zhang, and H. Lu, “Joint learning of salient object detection, depth estimation and contour extraction,” IEEE Transactions on Image Processing, vol. 31, pp. 7350–7362, 2022.
  17. N. Huang, Y. Yang, D. Zhang, Q. Zhang, and J. Han, “Employing bilinear fusion and saliency prior information for rgb-d salient object detection,” IEEE Transactions on Multimedia, vol. 24, pp. 1651–1664, 2021.
  18. D. Jin, F. Shao, Z. Xie, B. Mu, H. Chen, and Q. Jiang, “Cafcnet: Cross-modality asymmetric feature complement network for rgb-t salient object detection,” Expert Systems with Applications, vol. 247, p. 123222, 2024.
  19. R. Cong, K. Zhang, C. Zhang, F. Zheng, Y. Zhao, Q. Huang, and S. Kwong, “Does thermal really always matter for rgb-t salient object detection?” IEEE Transactions on Multimedia, vol. 25, pp. 6971–6982, 2023.
  20. Z. Liu, Y. Tan, Q. He, and Y. Xiao, “Swinnet: Swin transformer drives edge-aware rgb-d and rgb-t salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4486–4497, 2022.
  21. Y. Pang, X. Zhao, L. Zhang, and H. Lu, “Caver: Cross-modal view-mixed transformer for bi-modal salient object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 892–904, 2023.
  22. W. Gao, G. Liao, S. Ma, G. Li, Y. Liang, and W. Lin, “Unified information fusion network for multi-modal rgb-d and rgb-t salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, pp. 2091–2106, 2021.
  23. B. Tang, Z. Liu, Y. Tan, and Q. He, “Hrtransnet: Hrformer-driven two-modality salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–16, 2022.
  24. G. Chen, F. Shao, X. Chai, H. Chen, Q. Jiang, X. Meng, and Y.-S. Ho, “Modality-induced transfer-fusion network for rgb-d and rgb-t salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, pp. 1787–1801, 2023.
  25. X. Jia, Z. Zhao, C. Dongye, and Z. Zhang, “All in one: Rgb, rgb-d, and rgb-t salient object detection,” 2023.
  26. W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, pp. 415–424, 2022.
  27. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2022, pp. 10 012–10 022.
  28. X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, “Twins: Revisiting the design of spatial attention in vision transformers,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 9355–9366.
  29. S. Yang, W. Lin, G. Lin, Q. Jiang, and Z. Liu, “Progressive self-guided loss for salient object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 8426–8438, 2021.
  30. J.-J. Liu, Q. Hou, Z.-A. Liu, and M.-M. Cheng, “Poolnet+: Exploring the potential of pooling for salient object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 887–904, 2023.
  31. Y. K. Yun and W. Lin, “Towards a complete and detail-preserved salient object detection,” IEEE Transactions on Multimedia, pp. 1–15, 2023.
  32. N. Liu, N. Zhang, K. Wan, L. Shao, and J. Han, “Visual saliency transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 4722–4732.
  33. W. Zhou, Y. Zhu, J. Lei, J. Wan, and L. Yu, “Apnet: Adversarial learning assistance and perceived importance fusion network for all-day rgb-t salient object detection,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 4, pp. 957–968, 2022.
  34. P. Zhang, M. Xu, Z. Zhang, P. Gao, and J. Zhang, “Feature aggregation with transformer for rgb-t salient object detection,” Neurocomputing, vol. 546, pp. 126 329–126 329, 2023.
  35. W. Zhou, Y. Zhu, J. Lei, R. Yang, and L. Yu, “Lsnet: Lightweight spatial boosting network for detecting salient objects in rgb-thermal images,” IEEE Transactions on Image Processing, vol. 32, pp. 1329–1340, 2023.
  36. B. Wan, X. Zhou, Y. Sun, T. Wang, C. Lv, S. Wang, H. Yin, and C. Yan, “Mffnet: Multi-modal feature fusion network for v-d-t salient object detection,” IEEE Transactions on Multimedia, pp. 1–14, 2023.
  37. P. Houwen, L. Bing, X. Weihua, H. Weiming, and J. Rongrong, “RGB-D salient object detection: A benchmark and algorithms,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 92–109.
  38. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, pp. 84–90, 2012.
  39. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034, 2015.
  40. L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of the International Conference on Computational Statistics, 2010.

Summary

We haven't generated a summary for this paper yet.