Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization (2403.03095v1)
Abstract: Audio-Visual Source Localization (AVSL) is the task of identifying specific sounding objects in the scene given audio cues. In our work, we focus on semi-supervised AVSL with pseudo-labeling. To address the issues with vanilla hard pseudo-labels including bias accumulation, noise sensitivity, and instability, we propose a novel method named Cross Pseudo-Labeling (XPL), wherein two models learn from each other with the cross-refine mechanism to avoid bias accumulation. We equip XPL with two effective components. Firstly, the soft pseudo-labels with sharpening and pseudo-label exponential moving average mechanisms enable models to achieve gradual self-improvement and ensure stable training. Secondly, the curriculum data selection module adaptively selects pseudo-labels with high quality during training to mitigate potential bias. Experimental results demonstrate that XPL significantly outperforms existing methods, achieving state-of-the-art performance while effectively mitigating confirmation bias and ensuring training stability.
- “Soundspaces: Audio-visual navigation in 3d environments,” in ECCV. Springer, 2020, pp. 17–36.
- “Active audio-visual separation of dynamic sound sources,” in ECCV. Springer, 2022, pp. 551–569.
- “Learning to localize sound source in visual scenes,” in CVPR, 2018, pp. 4358–4366.
- “Learning to localize sound sources in visual scenes: Analysis and applications,” TPAMI, vol. 43, no. 5, pp. 1605–1619, 2019.
- “Deep multimodal clustering for unsupervised audiovisual learning,” in CVPR, 2019, pp. 9248–9257.
- “Localizing visual sounds the hard way,” in CVPR, 2021, pp. 16867–16876.
- “Localizing visual sounds the easy way,” in ECCV. Springer, 2022, pp. 218–234.
- “Learning sound localization better from semantically similar samples,” in ICASSP. IEEE, 2022, pp. 4863–4867.
- “A closer look at weakly-supervised audio-visual source localization,” in NeurIPS, 2022.
- “Visual sound localization in the wild by cross-modal interference erasing,” in AAAI, 2022, vol. 36, pp. 1801–1809.
- “Marginnce: Robust sound localization with a negative margin,” in ICASSP. IEEE, 2023, pp. 1–5.
- “Learning audio-visual source localization via false negative aware contrastive learning,” in CVPR, 2023, pp. 6420–6429.
- “Discriminative sounding objects localization via self-supervised audiovisual matching,” NeurIPS, vol. 33, pp. 10077–10087, 2020.
- “Class-aware sounding objects localization via audiovisual correspondence,” .
- “Multiple sound sources localization from coarse to fine,” in ECCV. Springer, 2020, pp. 292–308.
- “Mix and localize: Localizing sound sources in mixtures,” in CVPR, 2022, pp. 10483–10492.
- “Audio-visual grouping network for sound localization from mixtures,” in CVPR, 2023, pp. 10565–10574.
- “Dual mean-teacher: An unbiased semi-supervised framework for audio-visual source localization,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Dong-Hyun Lee et al., “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in ICML. Atlanta, 2013, vol. 3, p. 896.
- “Pseudo-labeling and confirmation bias in deep semi-supervised learning,” in IJCNN. IEEE, 2020, pp. 1–8.
- “Less can be more: Sound source localization with a classification model,” in WACV, 2022, pp. 3308–3317.
- “Exploiting transformation invariance and equivariance for self-supervised sound localisation,” in ACM MM, 2022, pp. 3742–3753.
- “Vggsound: A large-scale audio-visual dataset,” in ICASSP, 2020.
- “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- “Cnn architectures for large-scale audio classification,” in ICASSP. IEEE, 2017, pp. 131–135.
- “Soundnet: Learning sound representations from unlabeled video,” in NeurIPS, 2016.
- “Randaugment: Practical automated data augmentation with a reduced search space,” in CVPR, 2020, pp. 702–703.
- “Towards trustworthy dataset distillation,” arXiv preprint arXiv:2307.09165, 2023.
- “Open-world machine learning: A review and new outlooks,” arXiv preprint arXiv:2403.01759, 2024.