Toward end-to-end interpretable convolutional neural networks for waveform signals (2405.01815v1)
Abstract: This paper introduces a novel convolutional neural networks (CNN) framework tailored for end-to-end audio deep learning models, presenting advancements in efficiency and explainability. By benchmarking experiments on three standard speech emotion recognition datasets with five-fold cross-validation, our framework outperforms Mel spectrogram features by up to seven percent. It can potentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while remaining lightweight. Furthermore, we demonstrate the efficiency and interpretability of the front-end layer using the PhysioNet Heart Sound Database, illustrating its ability to handle and capture intricate long waveform patterns. Our contributions offer a portable solution for building efficient and interpretable models for raw waveform data.
- K. W. Cheuk, H. Anderson, K. Agres, and D. Herremans, “nnaudio: An on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks,” IEEE Access, vol. 8, pp. 161 981–162 003, 2020.
- M. Leiber, A. Barrau, Y. Marnissi, and D. Abboud, “A differentiable short-time fourier transform with respect to the window length,” in 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 2022, pp. 1392–1396.
- H. Seki, K. Yamamoto, and S. Nakagawa, “A deep neural network integrated with filterbank learning for speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5480–5484.
- M. Ravanelli and Y. Bengio, “Interpretable convolutional filters with sincnet,” Proc. of IRASL@NIPS, 2019.
- E. Loweimi, P. Bell, and S. Renals, “On Learning Interpretable CNNs with Parametric Modulated Kernel-Based Filters,” in Proc. Interspeech 2019, 2019, pp. 3480–3484.
- Y. Liu, J. Zhang, C. Gao, J. Qu, and L. Ji, “Natural-logarithm-rectified activation function in convolutional neural networks,” in 2019 IEEE 5th International Conference on Computer and Communications (ICCC). IEEE, 2019, pp. 2000–2008.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
- L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” Proc. International Conference on Learning Representations 2020, 2020.
- L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” in Artificial intelligence and machine learning for multi-domain operations applications, vol. 11006. SPIE, 2019, pp. 369–386.
- M. Gheorghe, S. Mihalache, and D. Burileanu, “Using deep neural networks for detecting depression from speech,” in 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 2023, pp. 411–415.
- S. Sarangi, M. Sahidullah, and G. Saha, “Optimization of data-driven filterbank for automatic speaker verification,” Digital Signal Processing, vol. 104, p. 102795, 2020.
- L. Vu, R. C.-W. Phan, L. W. Han, and D. Phung, “Improved speech emotion recognition based on music-related audio features,” in 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 2022, pp. 120–124.
- S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, p. e0196391, 2018.
- H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014.
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
- Y.-Y. Yang, M. Hira, Z. Ni, A. Chourdia, A. Astafurov, C. Chen, C.-F. Yeh, C. Puhrsch, D. Pollack, D. Genzel, D. Greenberg, E. Z. Yang, J. Lian, J. Mahadeokar, J. Hwang, J. Chen, P. Goldsborough, P. Roy, S. Narenthiran, S. Watanabe, S. Chintala, V. Quenneville-Bélair, and Y. Shi, “Torchaudio: Building blocks for audio and speech processing,” arXiv preprint arXiv:2110.15018, 2021.
- T. Li, Y. Yin, K. Ma, S. Zhang, and M. Liu, “Lightweight end-to-end neural network model for automatic heart sound classification,” Information, vol. 12, no. 2, p. 54, 2021.
- F. Li, Z. Zhang, L. Wang, and W. Liu, “Heart sound classification based on improved mel-frequency spectral coefficients and deep residual learning,” Frontiers in Physiology, vol. 13, p. 1084420, 2022.
- C. Liu, D. Springer, Q. Li, B. Moody, R. A. Juan, F. J. Chorro, F. Castells, J. M. Roig, I. Silva, A. E. Johnson et al., “An open access database for the evaluation of heart sound algorithms,” Physiological measurement, vol. 37, no. 12, p. 2181, 2016.
- M. Deng, T. Meng, J. Cao, S. Wang, J. Zhang, and H. Fan, “Heart sound classification based on improved mfcc features and convolutional recurrent neural networks,” Neural Networks, vol. 130, pp. 22–32, 2020.