emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition (2403.14083v1)
Abstract: Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance. While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.
- J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D & 2D CNN LSTM networks,” Biomedical Signal Processing and Control, vol. 47, pp. 312–323, 1 2019.
- M. A. Jalal, R. Milner, and T. Hain, “Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, pp. 4113–4117, 2020.
- E. Lieskovská, M. Jakubec, R. Jarina, M. Chmulík, Y.-F. Liao, P. Bours, and C. Kwan, “A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism,” Electronics 2021, Vol. 10, Page 1163, vol. 10, no. 10, p. 1163, 5 2021.
- S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Multitask Learning From Augmented Auxiliary Data for Improving Speech Emotion Recognition,” IEEE Transactions on Affective Computing, pp. 1–13, 7 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9947296/
- W. Han, H. Ruan, X. Chen, Z. Wang, H. Li, and B. Schuller, “Towards temporal modelling of categorical speech emotion recognition,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2018-September, pp. 932–936, 2018.
- K. N. Haque, M. A. Yousuf, and R. Rana, “Image denoising and restoration with CNN-LSTM Encoder Decoder with Direct Attention,” arXiv Prepr., pp. 1–12, 1 2018. [Online]. Available: https://arxiv.org/abs/1801.05141v1
- Y. Li, T. Zhao, and T. Kawahara, “Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2019-September, pp. 2803–2807, 2019.
- S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition,” IEEE Transactions on Affective Computing, 2022.
- H. Sun, Z. Lian, B. Liu, Y. Li, L. Sun, C. Cai, J. Tao, M. Wang, and Y. Cheng, “EmotionNAS: Two-stream Neural Architecture Search for Speech Emotion Recognition,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2023-August, pp. 3597–3601, 2023.
- X. Wu, S. Hu, Z. Wu, X. Liu, and H. Meng, “Neural Architecture Search for Speech Emotion Recognition,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2022-May, pp. 6902–6906, 2022.
- Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, and H. Xiong, “PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search,” in International Conference on Learning Representations, 7 2020.
- G. M. Biju, G. N. Pillai, and J. Seshadrinath, “Electric load demand forecasting with RNN cell generated by DARTS,” IEEE Region 10 Annual International Conference, Proceedings/TENCON, vol. 2019-October, pp. 2111–2116, 10 2019.
- R. Maulik, R. Egele, C. Polytechnique, B. Lusch, and P. Balaprakash, “Recurrent Neural Network Architecture Search for Geophysical Emulation,” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.
- W. Q. Zheng, J. S. Yu, and Y. X. Zou, “An experimental study of speech emotion recognition based on deep convolutional neural networks,” 2015 International Conference on Affective Computing and Intelligent Interaction, ACII 2015, pp. 827–831, 12 2015.
- G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 5200–5204, 5 2016.
- Z. Zhang, J. Han, K. Qian, and B. Schuller, “Evolving Learning for Analysing Mood-Related Infant Vocalisation,” Proceedings INTERSPEECH 2018, 19. Annual Conference of the International Speech Communication Association, pp. 142–146, 2018.
- B. Zoph and Q. V. Le, “Neural Architecture Search with Reinforcement Learning,” 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 11 2016.
- E. Liberis, L. Dudziak, and N. D. Lane, “μ𝜇\muitalic_μNAS: Constrained Neural Architecture Search for Microcontrollers,” Proceedings of the 1st Workshop on Machine Learning and Systems, EuroMLSys 2021, pp. 70–79, 10 2020.
- C. Gong, Z. Jiang, D. Wang, Y. Lin, Q. Liu, and D. Z. Pan, “Mixed precision neural architecture search for energy efficient deep learning,” IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD, vol. 2019-November, 11 2019.
- H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable Architecture Search,” 7th International Conference on Learning Representations, ICLR 2019, 6 2018. [Online]. Available: https://arxiv.org/abs/1806.09055v2
- C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive Neural Architecture Search,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 19–34.
- H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, and Z. Li, “DARTS+: Improved Differentiable Architecture Search with Early Stopping,” arXiv preprint, 9 2019.
- A. Wan, X. Dai, P. Zhang, Z. He, Y. Tian, S. Xie, B. Wu, M. Yu, T. Xu, K. Chen, P. Vajda, and J. E. Gonzalez, “FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12 965–12 974.
- X. Chen and C.-J. Hsieh, “Stabilizing Differentiable Architecture Search via Perturbation-based Regularization,” Proceedings of the 37th International Conference on Machine Learning, 2020.
- H. Yu, H. Peng, Y. Huang, J. Fu, H. Du, L. Wang, and H. Ling, “Cyclic Differentiable Architecture Search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 211–228, 1 2023.
- T. Rajapakshe, R. Rana, S. Khalifa, B. Sisman, and B. W. Schuller, “Enhancing Speech Emotion Recognition Through Differentiable Architecture Search,” arXiv preprint, 5 2023. [Online]. Available: https://arxiv.org/abs/2305.14402v3
- Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu, “Attention Based Fully Convolutional Network for Speech Emotion Recognition,” 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2018 - Proceedings, pp. 1771–1775, 7 2018.
- M. Chen, X. He, J. Yang, and H. Zhang, “3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition,” IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440–1444, 10 2018.
- H. Zou, Y. Si, C. Chen, D. Rajan, and E. S. Chng, “Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic Information,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2022-May, pp. 7367–7371, 2022.
- Z. T. Liu, M. T. Han, B. H. Wu, and A. Rehman, “Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning,” Applied Acoustics, vol. 202, p. 109178, 1 2023.
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, p. 335, 2008.
- C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, 2017.
- R. Lotfian and C. Busso, “Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 10 2019.
- S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980.
- S. Latif, R. Rana, S. Khalifa, R. Jurdak, and J. Epps, “Direct Modelling of Speech Emotion from Raw Speech,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019, pp. 3920–3924.
- B. McFee, C. Raffel, D. Liang, D. P. W. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, vol. 8, 2015.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury Google, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K. Xamla, E. Yang, Z. Devito, M. Raison Nabla, A. Tejani, S. Chilamkurthy, Q. Ai, B. Steiner, L. F. Facebook, J. B. Facebook, and S. Chintala, “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” Advances in Neural Information Processing Systems, pp. 8024–8035, 2019.