Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transavs: End-To-End Audio-Visual Segmentation With Transformer (2305.07223v2)

Published 12 May 2023 in cs.SD, cs.CV, cs.MM, and eess.AS

Abstract: Audio-Visual Segmentation (AVS) is a challenging task, which aims to segment sounding objects in video frames by exploring audio signals. Generally AVS faces two key challenges: (1) Audio signals inherently exhibit a high degree of information density, as sounds produced by multiple objects are entangled within the same audio stream; (2) Objects of the same category tend to produce similar audio signals, making it difficult to distinguish between them and thus leading to unclear segmentation results. Toward this end, we propose TransAVS, the first Transformer-based end-to-end framework for AVS task. Specifically, TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks with full transformer architectures. This scheme not only promotes comprehensive audio-image communication but also explicitly excavates instance cues encapsulated in the scene. Meanwhile, to encourage these audio queries to capture distinctive sounding objects instead of degrading to be homogeneous, we devise two self-supervised loss functions at both query and mask levels, allowing the model to capture distinctive features within similar audio data and achieve more precise segmentation. Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset, highlighting its effectiveness in bridging the gap between audio and visual modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. “Odor/taste integration and the perception of flavor,” Exp Brain Res, 2005.
  2. “Look, listen and learn,” in ICCV, 2017.
  3. “Objects that sound,” in ECCV, 2018.
  4. “Soundnet: Learning sound representations from unlabeled video,” PAMI, 2016.
  5. “See the sound, hear the pixels,” in WACV, 2020.
  6. “Dual-modality seq2seq network for audio-visual event localization,” in ICASSP, 2019.
  7. “Audiovisual transformer with instance attention for audio-visual event localization,” in ACCV, 2020.
  8. “Audio-visual event localization in unconstrained videos,” in ECCV, 2018.
  9. “Dual attention matching for audio-visual event localization,” in ECCV, 2019.
  10. “Cross-modal relation-aware networks for audio-visual event localization,” in ACM MM, 2020.
  11. “Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,” NIPS, 2021.
  12. “Unified multisensory perception: Weakly-supervised audio-visual video parsing,” in ECCV, 2020.
  13. Yu Wu and Yi Yang, “Exploring heterogeneous clues for weakly-supervised audio-visual video parsing,” in CVPR, 2021.
  14. “Joint-modal label denoising for weakly-supervised audio-visual video parsing,” in ECCV, 2022.
  15. “Learning to localize sound source in visual scenes,” in CVPR, 2018.
  16. “Audio–visual segmentation,” in ECCV, 2022.
  17. “Feature pyramid networks for object detection,” in CVPR, 2017.
  18. “Towards open vocabulary learning: A survey,” arXiv preprint arXiv:2306.15880, 2023.
  19. “Sfnet: Faster and accurate semantic segmentation via semantic flow,” IJCV, 2023.
  20. “Multimodal learning with transformers: A survey,” PAMI, 2023.
  21. “Localizing visual sounds the hard way,” in CVPR, 2021.
  22. “Multiple sound sources localization from coarse to fine,” in ECCV, 2020.
  23. “Making a case for 3d convolutions for object segmentation in videos,” BMVC, 2020.
  24. “Sstvos: Sparse spatiotemporal transformers for video object segmentation,” in CVPR, 2021.
  25. “Transformer transforms salient object detection and camouflaged object detection,” CoRR, 2021.
  26. “Learning generative vision transformer with energy-based latent space for saliency prediction,” NIPS, 2021.
  27. “Cnn architectures for large-scale audio classification,” in ICASSP, 2017.
  28. “Attention is all you need,” NIPS, 2017.
  29. “Focal loss for dense object detection,” in ICCV, 2017.
  30. “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV, 2016.
  31. “Deep residual learning for image recognition,” in CVPR, 2016.
  32. “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
  33. “Imagenet large scale visual recognition challenge,” Int J Comput Vis, 2015.
  34. “Decoupled weight decay regularization,” in ICLR, 2019.
  35. “The pascal visual object classes (voc) challenge,” Int J Comput Vis, 2010.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuhang Ling (1 paper)
  2. Yuxi Li (45 papers)
  3. Zhenye Gan (22 papers)
  4. Jiangning Zhang (102 papers)
  5. Mingmin Chi (24 papers)
  6. Yabiao Wang (93 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.