Frequency-aware convolution for sound event detection (2403.13252v2)
Abstract: In sound event detection (SED), convolutional neural networks (CNNs) are widely employed to extract time-frequency (TF) patterns from spectrograms. However, the ability of CNNs to recognize different sound events is limited by their insensitivity to shifts of TF patterns along the frequency dimension, caused by translation equivariance. To address this issue, a model called frequency dynamic convolution (FDY) has been proposed, which involves applying specific convolution kernels to different frequency components. However, FDY requires a significantly larger number of parameters and computational resources compared to a standard CNN. This paper proposes a more efficient solution called frequency-aware convolution (FAC). FAC incorporates frequency positional information by encoding it in a vector, which is then explicitly added to the input spectrogram. To ensure that the amplitude of the encoding vector matches that of the input spectrogram, the encoding vector is adaptively and channel-dependently scaled using self-attention. To evaluate the effectiveness of FAC, we conducted experiments within the context of the DCASE 2023 task 4. The results show that FAC achieves comparable performance to FDY while requiring only an additional 515 parameters, whereas FDY necessitates an additional 8.02 million parameters. Furthermore, an ablation study confirms that the adaptive and channel-dependent scaling of the encoding vector is critical to the performance of FAC.
- N. Shreyas, M. Venkatraman, S. Malini, and S. Chandrakala, “Trends of sound event recognition in audio surveillance: A recent review and study,” in The Cognitive Approach in Cloud Computing and Internet of Things Technologies for Surveillance Tracking Systems, D. Peter, A. H. Alavi, B. Javadi, and S. L. Fernandes, Eds. Academic Press, 2020, ch. 7, pp. 95–106.
- A. Vafeiadis, K. Votis, D. Giakoumis, D. Tzovaras, L. Chen, and R. Hamzaoui, “Audio content analysis for unobtrusive event detection in smart homes,” Engineering Applications of Artificial Intelligence, vol. 89, p. 103226, 2020.
- Z. Mnasri, S. Rovetta, and F. Masulli, “Anomalous sound event detection: A survey of machine learning based methods and applications,” Multimedia Tools and Applications, vol. 81, no. 4, pp. 5537–5586, 2022.
- E. Çakır, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291–1303, 2017.
- K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Weakly-supervised sound event detection with self-attention,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, Apr. 2020, pp. 66–70.
- J. Ebbers and R. Haeb-Umbach, “Self-trained audio tagging and sound event detection in domestic environments,” in Detection and Classification of Acoustic Scenes and Events (DCASE), 2021, pp. 226––230.
- K. Guirguis, C. Schorn, A. Guntoro, S. Abdulatif, and B. Yang, “Seld-tcn: Sound event localization & detection via temporal convolutional networks,” in European Signal Processing Conference (EUSIPCO), Amsterdam, Netherlands, Dec. 2021, pp. 16–20.
- K. Wakayama and S. Saito, “Cnn-transformer with self-attention network for sound event detection,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, Apr. 2022, pp. 806–810.
- H. Nam, S. Kim, B. Ko, and Y. Park, “Frequency dynamic convolution: Frequency-adaptive pattern recognition for sound event detection,” in Interspeech, Incheon, Korea, Sep. 2022, pp. 2763–2767.
- K. Koutini, H. Eghbalzadeh, and G. Widmer, “Receptive-field-regularized cnn variants for acoustic scene classification,” in Workshop on Detection and Classification of Acoustic Scenes and Events, 2019, pp. 1–5.
- A. Rakowski and M. Kosmider, “Frequency-aware cnn for open set acoustic scene classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, 2019, pp. 25–26.
- N. Aryal and S.-W. Lee, “Frequency-based cnn and attention module for acoustic scene classification,” Applied Acoustics, vol. 210, p. 109411, 2023.
- R. Serizel, N. Turpault, A. Shah, and J. Salamon, “Sound event detection in synthetic domestic environments,” in International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, May 2020.
- A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.” Red Hook, NY: Curran Associates Inc., 2017, p. 1195–1204.
- H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations (ICLR), Vancouver, Canada, Apr. 2018.
- H. Nam, S.-H. Kim, and Y.-H. Park, “Filteraugment: An acoustic environmental data augmentation method,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, Apr. 2022, pp. 4308–4312.
- Ç. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulović, “A framework for the robust evaluation of sound event detection,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, Apr. 2020, pp. 61–65.
- “DCASE 2022 task 4: Task description,” https://dcase.community/challenge2022/task-sound-event-detection-in-domestic-environments.