Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization (2403.03095v1)

Published 5 Mar 2024 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: Audio-Visual Source Localization (AVSL) is the task of identifying specific sounding objects in the scene given audio cues. In our work, we focus on semi-supervised AVSL with pseudo-labeling. To address the issues with vanilla hard pseudo-labels including bias accumulation, noise sensitivity, and instability, we propose a novel method named Cross Pseudo-Labeling (XPL), wherein two models learn from each other with the cross-refine mechanism to avoid bias accumulation. We equip XPL with two effective components. Firstly, the soft pseudo-labels with sharpening and pseudo-label exponential moving average mechanisms enable models to achieve gradual self-improvement and ensure stable training. Secondly, the curriculum data selection module adaptively selects pseudo-labels with high quality during training to mitigate potential bias. Experimental results demonstrate that XPL significantly outperforms existing methods, achieving state-of-the-art performance while effectively mitigating confirmation bias and ensuring training stability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Soundspaces: Audio-visual navigation in 3d environments,” in ECCV. Springer, 2020, pp. 17–36.
  2. “Active audio-visual separation of dynamic sound sources,” in ECCV. Springer, 2022, pp. 551–569.
  3. “Learning to localize sound source in visual scenes,” in CVPR, 2018, pp. 4358–4366.
  4. “Learning to localize sound sources in visual scenes: Analysis and applications,” TPAMI, vol. 43, no. 5, pp. 1605–1619, 2019.
  5. “Deep multimodal clustering for unsupervised audiovisual learning,” in CVPR, 2019, pp. 9248–9257.
  6. “Localizing visual sounds the hard way,” in CVPR, 2021, pp. 16867–16876.
  7. “Localizing visual sounds the easy way,” in ECCV. Springer, 2022, pp. 218–234.
  8. “Learning sound localization better from semantically similar samples,” in ICASSP. IEEE, 2022, pp. 4863–4867.
  9. “A closer look at weakly-supervised audio-visual source localization,” in NeurIPS, 2022.
  10. “Visual sound localization in the wild by cross-modal interference erasing,” in AAAI, 2022, vol. 36, pp. 1801–1809.
  11. “Marginnce: Robust sound localization with a negative margin,” in ICASSP. IEEE, 2023, pp. 1–5.
  12. “Learning audio-visual source localization via false negative aware contrastive learning,” in CVPR, 2023, pp. 6420–6429.
  13. “Discriminative sounding objects localization via self-supervised audiovisual matching,” NeurIPS, vol. 33, pp. 10077–10087, 2020.
  14. “Class-aware sounding objects localization via audiovisual correspondence,” .
  15. “Multiple sound sources localization from coarse to fine,” in ECCV. Springer, 2020, pp. 292–308.
  16. “Mix and localize: Localizing sound sources in mixtures,” in CVPR, 2022, pp. 10483–10492.
  17. “Audio-visual grouping network for sound localization from mixtures,” in CVPR, 2023, pp. 10565–10574.
  18. “Dual mean-teacher: An unbiased semi-supervised framework for audio-visual source localization,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  19. Dong-Hyun Lee et al., “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in ICML. Atlanta, 2013, vol. 3, p. 896.
  20. “Pseudo-labeling and confirmation bias in deep semi-supervised learning,” in IJCNN. IEEE, 2020, pp. 1–8.
  21. “Less can be more: Sound source localization with a classification model,” in WACV, 2022, pp. 3308–3317.
  22. “Exploiting transformation invariance and equivariance for self-supervised sound localisation,” in ACM MM, 2022, pp. 3742–3753.
  23. “Vggsound: A large-scale audio-visual dataset,” in ICASSP, 2020.
  24. “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  25. “Cnn architectures for large-scale audio classification,” in ICASSP. IEEE, 2017, pp. 131–135.
  26. “Soundnet: Learning sound representations from unlabeled video,” in NeurIPS, 2016.
  27. “Randaugment: Practical automated data augmentation with a reduced search space,” in CVPR, 2020, pp. 702–703.
  28. “Towards trustworthy dataset distillation,” arXiv preprint arXiv:2307.09165, 2023.
  29. “Open-world machine learning: A review and new outlooks,” arXiv preprint arXiv:2403.01759, 2024.
Citations (2)

Summary

We haven't generated a summary for this paper yet.