TF-SepNet: An Efficient 1D Kernel Design in CNNs for Low-Complexity Acoustic Scene Classification (2309.08200v4)
Abstract: Recent studies focus on developing efficient systems for acoustic scene classification (ASC) using convolutional neural networks (CNNs), which typically consist of consecutive kernels. This paper highlights the benefits of using separate kernels as a more powerful and efficient design approach in ASC tasks. Inspired by the time-frequency nature of audio signals, we propose TF-SepNet, a CNN architecture that separates the feature processing along the time and frequency dimensions. Features resulted from the separate paths are then merged by channels and directly forwarded to the classifier. Instead of the conventional two dimensional (2D) kernel, TF-SepNet incorporates one dimensional (1D) kernels to reduce the computational costs. Experiments have been conducted using the TAU Urban Acoustic Scene 2022 Mobile development dataset. The results show that TF-SepNet outperforms similar state-of-the-arts that use consecutive kernels. A further investigation reveals that the separate kernels lead to a larger effective receptive field (ERF), which enables TF-SepNet to capture more time-frequency features.
- “Acoustic scene classification: Classifying environments from the sounds they produce,” IEEE Signal Processing Magazine, vol. 32, no. 3, 2015.
- “Low-complexity acoustic scene classification in DCASE 2022 Challenge,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, 2022.
- Jakob Abeßer, “A review of deep learning based methods for acoustic scene classification,” Applied Sciences, vol. 10, no. 6, 2020.
- “Receptive-field-regularized CNN variants for acoustic scene classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, 2019, pp. 124–128.
- “A two-stage approach to device-robust acoustic scene classification,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 845–849.
- “Acoustic scene classification based on a large-margin factorized CNN,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, 2019, pp. 45–49.
- “Broadcasted Residual Learning for Efficient Keyword Spotting,” in Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH). ISCA, 2021, pp. 4538–4542.
- “Multi-scale architecture and device-aware data-random-drop based fine-tuning method for acoustic scene classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, 2022.
- “Low-complexity acoustic scene classification using time frequency separable convolution,” Electronics, vol. 11, no. 17, 2022.
- “ShuffleNet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6848–6856.
- “Environmental sound classification using temporal-frequency attention based convolutional neural network,” Scientific Reports, vol. 11, no. 1, 2021.
- “A multi-device dataset for urban acoustic scene classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, 2018, pp. 9–13.
- “Understanding the effective receptive field in deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 29, 2016.
- “DCASE2023 task1 submission: Device simulation and time-frequency separable convolution for acoustic scene classification,” Tech. Rep., Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge, 2023.
- “CP-JKU submission to DCASE22: Distilling knowledge for low-complexity convolutional neural networks from a patchout audio transformer,” Tech. Rep., Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge, 2022.
- “Accurate, large minibatch SGD: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
- “SGDR: Stochastic gradient descent with warm restarts,” in Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- “mixup: Beyond empirical risk minimization,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- “Scaling up your kernels to 31x31: Revisiting large kernel design in CNNs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11963–11975.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.
 
          