Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition (2306.17500v1)
Abstract: Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech. Variations of consonant-vowel (CV) phonemic boundaries can enrich acoustic context with linguistic cues, which impacts SER. In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration. However, phone boundaries within speech are not discrete events, therefore the perceived emotion state should also be distributed over potentially continuous time-windows. This research explores the implication of acoustic context and phone boundaries on local markers for SER using an attention-based approach. The benefits of using a distributed approach to speech emotion understanding are supported by the results of cross-corpora analysis experiments. Experiments where phones and words are mapped to the attention vectors along with the fundamental frequency to observe the overlapping distributions and thereby the relationship between acoustic context and emotion. This work aims to bridge psycholinguistic theory research with computational modelling for SER.
- P. Ekman, “An argument for basic emotions,” Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992.
- J. A. Russell, “A circumplex model of affect.” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980.
- O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The enterface’05 audio-visual emotion database,” in International Conference on Data Engineering Workshops, ICDE, Atlanta, GA, USA, Apr 3-7, 2006, p. 8. [Online]. Available: https://doi.org/10.1109/ICDEW.2006.145
- C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. Narayanan, “IEMOCAP: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008. [Online]. Available: https://doi.org/10.1007/s10579-008-9076-6
- A. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency, “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Annual Meeting of the Association for Computational Linguistics, ACL, Melbourne, Australia, Jul 15-20, 2018, pp. 2236–2246. [Online]. Available: https://www.aclweb.org/anthology/P18-1208/
- E. Coutinho and N. Dibben, “Psychoacoustic cues to emotion in speech prosody and music,” Cognition and Emotion, vol. 27, no. 4, pp. 658–684, 2013, pMID: 23057507.
- G. Ilie and W. F. Thompson, “Experiential and cognitive changes following seven minutes exposure to music and speech,” Music Perception, vol. 28, no. 3, pp. 247–264, 2011.
- M. A. Jalal, R. Milner, and T. Hain, “Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition,” in Proceedings of Interspeech 2020. International Speech Communication Association (ISCA), 2020, pp. 4113–4117.
- B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., vol. 2, pp. 401–404, 2003.
- S. Zhang, T. Huang, and W. Gao, “Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval - ICMR ’16, 2016, pp. 281–284.
- S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 2227–2231.
- C.-W. Huang and S. S. Narayanan, “Attention assisted discovery of sub-utterance structure in speech emotion recognition.” in Interspeech, 2016, pp. 1387–1391.
- Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition using cnn,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 801–804.
- J. Kim, G. Englebienne, K. P. Truong, and V. Evers, “Deep temporal models using identity skip-connections for speech emotion recognition,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1006–1013.
- A. Nediyanchath, P. Paramasivam, and P. Yenigalla, “Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7179–7183.
- Z. Lian, B. Liu, and J. Tao, “Ctnet: Conversational transformer network for emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 985–1000, 2021.
- D. Wilson and T. Wharton, “Relevance and prosody,” Journal of pragmatics, vol. 38, no. 10, pp. 1559–1579, 2006.
- K. Hoemann, A. N. Crittenden, S. Msafiri, Q. Liu, C. Li, D. Roberson, G. A. Ruark, M. Gendron, and L. Feldman Barrett, “Context facilitates performance on a classic cross-cultural emotion perception task.” Emotion, vol. 19, no. 7, p. 1292, 2019.
- K. R. Scherer, R. Banse, and H. G. Wallbott, “Emotion inferences from vocal expression correlate across languages and cultures,” Journal of Cross-cultural psychology, vol. 32, no. 1, pp. 76–92, 2001.
- W. F. Thompson and L.-L. Balkwill, “Decoding speech prosody in five languages,” vol. 2006, no. 158, pp. 407–424, 2006. [Online]. Available: https://doi.org/10.1515/SEM.2006.017
- C. Stilp, “Acoustic context effects in speech perception,” Wiley Interdisciplinary Reviews: Cognitive Science, vol. 11, no. 1, p. e1517, 2020.
- K. R. Kluender, C. E. Stilp, and F. L. Lucas, “Long-standing problems in speech perception dissolve within an information-theoretic perspective,” Attention, Perception, & Psychophysics, vol. 81, no. 4, pp. 861–883, 2019.
- M. Van de Ven and M. Ernestus, “The role of segmental and durational cues in the processing of reduced words,” Language and Speech, vol. 61, no. 3, pp. 358–383, 2018.
- D. Fogerty and D. Kewley-Port, “Perceptual contributions of the consonant-vowel boundary to sentence intelligibility,” The Journal of the Acoustical Society of America, vol. 126, no. 2, pp. 847–857, 2009.
- M. J. Owren and G. C. Cardillo, “The relative roles of vowels and consonants in discriminating talker identity versus word meaning,” The Journal of the Acoustical Society of America, vol. 119, no. 3, pp. 1727–1739, 2006.
- R. A. Cole, Y. Yan, B. Mak, M. Fanty, and T. Bailey, “The contribution of consonants versus vowels to word recognition in fluent speech,” in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 2. IEEE, 1996, pp. 853–856.
- P. Ladefoged and D. Broadbent, “Information conveyed by vowels,” Journal of the Acoustical Society of America, 1957.
- D. Xu, F. Chen, F. Pan, and D. Zheng, “Factors affecting the intelligibility of high-intensity-level-based speech,” The Journal of the Acoustical Society of America, vol. 146, no. 2, pp. EL151–EL157, 2019.
- F. S. Cooper, P. C. Delattre, A. M. Liberman, J. M. Borst, and L. J. Gerstman, “Some experiments on the perception of synthetic speech sounds,” The Journal of the Acoustical Society of America, 1952.
- J. L. Miller, “On the internal structure of phonetic categories: A progress report,” Cognition, vol. 50, no. 1-3, pp. 271–285, 1994.
- A. Graves, N. Jaitly, and A. Mohamed, “Hybrid speech recognition with deep bidirectional LSTM,” in IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, Dec 8-12,, 2013, pp. 273–278. [Online]. Available: https://doi.org/10.1109/ASRU.2013.6707742
- M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997. [Online]. Available: https://doi.org/10.1109/78.650093
- T. Mani Kumar, E. Sanchez, G. Tzimiropoulos, T. Giesbrecht, and M. Valstar, “Stochastic process regression for cross-cultural speech emotion recognition,” Proc. Interspeech 2021, pp. 3390–3394, 2021.
- R. Beard, R. Das, R. W. Ng, P. K. Gopalakrishnan, L. Eerens, P. Swietojanski, and O. Miksik, “Multi-modal sequence fusion via recursive attention for emotion recognition,” in Proceedings of the 22nd conference on computational natural language learning, 2018, pp. 251–259.
- R. Milner, M. A. Jalal, R. W. Ng, and T. Hain, “A cross-corpus study on speech emotion recognition,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 304–311.