Papers
Topics
Authors
Recent
2000 character limit reached

Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization (2403.03095v1)

Published 5 Mar 2024 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: Audio-Visual Source Localization (AVSL) is the task of identifying specific sounding objects in the scene given audio cues. In our work, we focus on semi-supervised AVSL with pseudo-labeling. To address the issues with vanilla hard pseudo-labels including bias accumulation, noise sensitivity, and instability, we propose a novel method named Cross Pseudo-Labeling (XPL), wherein two models learn from each other with the cross-refine mechanism to avoid bias accumulation. We equip XPL with two effective components. Firstly, the soft pseudo-labels with sharpening and pseudo-label exponential moving average mechanisms enable models to achieve gradual self-improvement and ensure stable training. Secondly, the curriculum data selection module adaptively selects pseudo-labels with high quality during training to mitigate potential bias. Experimental results demonstrate that XPL significantly outperforms existing methods, achieving state-of-the-art performance while effectively mitigating confirmation bias and ensuring training stability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Soundspaces: Audio-visual navigation in 3d environments,” in ECCV. Springer, 2020, pp. 17–36.
  2. “Active audio-visual separation of dynamic sound sources,” in ECCV. Springer, 2022, pp. 551–569.
  3. “Learning to localize sound source in visual scenes,” in CVPR, 2018, pp. 4358–4366.
  4. “Learning to localize sound sources in visual scenes: Analysis and applications,” TPAMI, vol. 43, no. 5, pp. 1605–1619, 2019.
  5. “Deep multimodal clustering for unsupervised audiovisual learning,” in CVPR, 2019, pp. 9248–9257.
  6. “Localizing visual sounds the hard way,” in CVPR, 2021, pp. 16867–16876.
  7. “Localizing visual sounds the easy way,” in ECCV. Springer, 2022, pp. 218–234.
  8. “Learning sound localization better from semantically similar samples,” in ICASSP. IEEE, 2022, pp. 4863–4867.
  9. “A closer look at weakly-supervised audio-visual source localization,” in NeurIPS, 2022.
  10. “Visual sound localization in the wild by cross-modal interference erasing,” in AAAI, 2022, vol. 36, pp. 1801–1809.
  11. “Marginnce: Robust sound localization with a negative margin,” in ICASSP. IEEE, 2023, pp. 1–5.
  12. “Learning audio-visual source localization via false negative aware contrastive learning,” in CVPR, 2023, pp. 6420–6429.
  13. “Discriminative sounding objects localization via self-supervised audiovisual matching,” NeurIPS, vol. 33, pp. 10077–10087, 2020.
  14. “Class-aware sounding objects localization via audiovisual correspondence,” .
  15. “Multiple sound sources localization from coarse to fine,” in ECCV. Springer, 2020, pp. 292–308.
  16. “Mix and localize: Localizing sound sources in mixtures,” in CVPR, 2022, pp. 10483–10492.
  17. “Audio-visual grouping network for sound localization from mixtures,” in CVPR, 2023, pp. 10565–10574.
  18. “Dual mean-teacher: An unbiased semi-supervised framework for audio-visual source localization,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  19. Dong-Hyun Lee et al., “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in ICML. Atlanta, 2013, vol. 3, p. 896.
  20. “Pseudo-labeling and confirmation bias in deep semi-supervised learning,” in IJCNN. IEEE, 2020, pp. 1–8.
  21. “Less can be more: Sound source localization with a classification model,” in WACV, 2022, pp. 3308–3317.
  22. “Exploiting transformation invariance and equivariance for self-supervised sound localisation,” in ACM MM, 2022, pp. 3742–3753.
  23. “Vggsound: A large-scale audio-visual dataset,” in ICASSP, 2020.
  24. “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  25. “Cnn architectures for large-scale audio classification,” in ICASSP. IEEE, 2017, pp. 131–135.
  26. “Soundnet: Learning sound representations from unlabeled video,” in NeurIPS, 2016.
  27. “Randaugment: Practical automated data augmentation with a reduced search space,” in CVPR, 2020, pp. 702–703.
  28. “Towards trustworthy dataset distillation,” arXiv preprint arXiv:2307.09165, 2023.
  29. “Open-world machine learning: A review and new outlooks,” arXiv preprint arXiv:2403.01759, 2024.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.