Toward end-to-end interpretable convolutional neural networks for waveform signals (2405.01815v1)
Abstract: This paper introduces a novel convolutional neural networks (CNN) framework tailored for end-to-end audio deep learning models, presenting advancements in efficiency and explainability. By benchmarking experiments on three standard speech emotion recognition datasets with five-fold cross-validation, our framework outperforms Mel spectrogram features by up to seven percent. It can potentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while remaining lightweight. Furthermore, we demonstrate the efficiency and interpretability of the front-end layer using the PhysioNet Heart Sound Database, illustrating its ability to handle and capture intricate long waveform patterns. Our contributions offer a portable solution for building efficient and interpretable models for raw waveform data.
- K. W. Cheuk, H. Anderson, K. Agres, and D. Herremans, “nnaudio: An on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks,” IEEE Access, vol. 8, pp. 161 981–162 003, 2020.
- M. Leiber, A. Barrau, Y. Marnissi, and D. Abboud, “A differentiable short-time fourier transform with respect to the window length,” in 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 2022, pp. 1392–1396.
- H. Seki, K. Yamamoto, and S. Nakagawa, “A deep neural network integrated with filterbank learning for speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5480–5484.
- M. Ravanelli and Y. Bengio, “Interpretable convolutional filters with sincnet,” Proc. of IRASL@NIPS, 2019.
- E. Loweimi, P. Bell, and S. Renals, “On Learning Interpretable CNNs with Parametric Modulated Kernel-Based Filters,” in Proc. Interspeech 2019, 2019, pp. 3480–3484.
- Y. Liu, J. Zhang, C. Gao, J. Qu, and L. Ji, “Natural-logarithm-rectified activation function in convolutional neural networks,” in 2019 IEEE 5th International Conference on Computer and Communications (ICCC). IEEE, 2019, pp. 2000–2008.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
- L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” Proc. International Conference on Learning Representations 2020, 2020.
- L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” in Artificial intelligence and machine learning for multi-domain operations applications, vol. 11006. SPIE, 2019, pp. 369–386.
- M. Gheorghe, S. Mihalache, and D. Burileanu, “Using deep neural networks for detecting depression from speech,” in 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 2023, pp. 411–415.
- S. Sarangi, M. Sahidullah, and G. Saha, “Optimization of data-driven filterbank for automatic speaker verification,” Digital Signal Processing, vol. 104, p. 102795, 2020.
- L. Vu, R. C.-W. Phan, L. W. Han, and D. Phung, “Improved speech emotion recognition based on music-related audio features,” in 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 2022, pp. 120–124.
- S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, p. e0196391, 2018.
- H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014.
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
- Y.-Y. Yang, M. Hira, Z. Ni, A. Chourdia, A. Astafurov, C. Chen, C.-F. Yeh, C. Puhrsch, D. Pollack, D. Genzel, D. Greenberg, E. Z. Yang, J. Lian, J. Mahadeokar, J. Hwang, J. Chen, P. Goldsborough, P. Roy, S. Narenthiran, S. Watanabe, S. Chintala, V. Quenneville-Bélair, and Y. Shi, “Torchaudio: Building blocks for audio and speech processing,” arXiv preprint arXiv:2110.15018, 2021.
- T. Li, Y. Yin, K. Ma, S. Zhang, and M. Liu, “Lightweight end-to-end neural network model for automatic heart sound classification,” Information, vol. 12, no. 2, p. 54, 2021.
- F. Li, Z. Zhang, L. Wang, and W. Liu, “Heart sound classification based on improved mel-frequency spectral coefficients and deep residual learning,” Frontiers in Physiology, vol. 13, p. 1084420, 2022.
- C. Liu, D. Springer, Q. Li, B. Moody, R. A. Juan, F. J. Chorro, F. Castells, J. M. Roig, I. Silva, A. E. Johnson et al., “An open access database for the evaluation of heart sound algorithms,” Physiological measurement, vol. 37, no. 12, p. 2181, 2016.
- M. Deng, T. Meng, J. Cao, S. Wang, J. Zhang, and H. Fan, “Heart sound classification based on improved mfcc features and convolutional recurrent neural networks,” Neural Networks, vol. 130, pp. 22–32, 2020.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.