Device-Robust Acoustic Scene Classification via Impulse Response Augmentation (2305.07499v2)
Abstract: The ability to generalize to a wide range of recording devices is a crucial performance factor for audio classification models. The characteristics of different types of microphones introduce distributional shifts in the digitized audio signals due to their varying frequency responses. If this domain shift is not taken into account during training, the model's performance could degrade severely when it is applied to signals recorded by unseen devices. In particular, training a model on audio signals recorded with a small number of different microphones can make generalization to unseen devices difficult. To tackle this problem, we convolve audio signals in the training set with pre-recorded device impulse responses (DIRs) to artificially increase the diversity of recording devices. We systematically study the effect of DIR augmentation on the task of Acoustic Scene Classification using CNNs and Audio Spectrogram Transformers. The results show that DIR augmentation in isolation performs similarly to the state-of-the-art method Freq-MixStyle. However, we also show that DIR augmentation and Freq-MixStyle are complementary, achieving a new state-of-the-art performance on signals recorded by devices unseen during training.
- G. Wilson and D. J. Cook, “A survey of unsupervised deep domain adaptation,” ACM Trans. Intell. Syst. Technol., 2020.
- T. Heittola, A. Mesaros, and T. Virtanen, “Acoustic scene classification in DCASE 2020 challenge: Generalization across devices and low complexity solutions,” in Proceedings of the 5th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2020.
- I. Martín-Morató, F. Paissan, A. Ancilotto, T. Heittola, A. Mesaros, E. Farella, A. Brutti, and T. Virtanen, “Low-complexity acoustic scene classification in DCASE 2022 challenge,” in Proceedings of the 7th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2022.
- K. Koutini, F. Henkel, H. Eghbal-zadeh, and G. Widmer, “CP-JKU Submissions to DCASE’20: Low-Complexity Cross-Device Acoustic Scene Classification with RF-Regularized CNNs,” Tech. Rep., DCASE Challenge, 2020.
- B. Kim, S. Yang, J. Kim, H. Park, J. Lee, and S. Chang, “Domain generalization with relaxed instance frequency-wise normalization for multi-device acoustic scene classification,” in 23rd Annual Conference of the International Speech Communication Association (Interspeech). ISCA, 2022.
- J.-H. Lee, J.-H. Choi, P. M. Byun, and J.-H. Chang, “Multi-scale architecture and device-aware data-random-drop based fine-tuning method for acoustic scene classification,” in Proceedings of the 7th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2022.
- F. Schmid, S. Masoudian, K. Koutini, and G. Widmer, “Knowledge distillation from transformers for low-complexity acoustic scene classification,” in Proceedings of the 7th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2022.
- K. Koutini, H. Eghbal-zadeh, M. Dorfer, and G. Widmer, “The receptive field as a regularizer in deep convolutional neural networks for acoustic scene classification,” in 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019.
- K. Koutini, H. Eghbal-zadeh, and G. Widmer, “Receptive field regularization techniques for audio classification and tagging with deep convolutional neural networks,” IEEE ACM Trans. Audio Speech Lang. Process., 2021.
- K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient training of audio transformers with patchout,” in 23rd Annual Conference of the International Speech Communication Association (Interspeech). ISCA, 2022.
- A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola, “A kernel two-sample test,” J. Mach. Learn. Res., 2012.
- S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. K. Rohde, “Generalized sliced wasserstein distances,” in Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.
- P. Primus, H. Eghbal-zadeh, D. Eitelsebner, K. Koutini, A. Arzt, and G. Widmer, “Exploiting parallel audio recordings to enforce device invariance in cnn-based acoustic scene classification,” in Proceedings of the 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019.
- B. Kim, S. Yang, J. Kim, and S. Chang, “QTI submission to DCASE 2021: residual normalization for device-imbalanced acoustic scene classification with efficient design,” Tech. Rep., DCASE Challenge, 2021.
- I. Martín-Morató, T. Heittola, A. Mesaros, and T. Virtanen, “Low-complexity acoustic scene classification for multi-device audio: Analysis of DCASE 2021 challenge systems,” in Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2021.
- I. Szöke, M. Skácel, L. Mosner, J. Paliesek, and J. H. Cernocký, “Building and evaluation of a real room impulse response dataset,” IEEE J. Sel. Top. Signal Process., 2019.
- T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
- A. Sriram, H. Jun, Y. Gaur, and S. Satheesh, “Robust speech recognition using generative adversarial networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
- M. Ritter, M. Müller, S. Stüker, F. Metze, and A. Waibel, “Training deep neural networks for reverberation robust speech recognition,” in Proceedings of the 12th ITG Symposium on Speech Communication. IEEE, 2016.
- M. Ferras, S. R. Madikeri, P. Motlícek, S. Dey, and H. Bourlard, “A large-scale open-source acoustic simulator for speaker recognition,” IEEE Signal Process. Lett., 2016.
- V.-V. Eklund, “Data augmentation techniques for robust audio analysis,” Master’s Thesis, Tampere University, 2019.
- K. Koutini, S. Jan, and G. Widmer, “CPJKU Submission to DCASE21: Cross-Device Audio Scene Classification with Wide Sparse Frequency-Damped CNNs,” Tech. Rep., DCASE Challenge, 2021.
- J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in 7th International Conference on Learning Representations (ICLR). OpenReview.net, 2019.
- H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in 6th International Conference on Learning Representations (ICLR). OpenReview.net, 2018.
- K. Koutini, H. Eghbal-zadeh, and G. Widmer, “CP-JKU submissions to DCASE’19: Acoustic scene classification and audio tagging with receptive-field-regularized CNNs,” Tech. Rep., DCASE Challenge, 2019.
- J. Fu, Y. Zhong, and F. Yang, “Adversarial domain generalization with mixstyle,” in International Conference on Advanced Robotics and Mechatronics (ICARM). IEEE, 2022.