Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations (2309.04849v2)

Published 9 Sep 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  2. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
  3. “Emotion recognition from speech using wav2vec 2.0 embeddings,” Interspeech, pp. 3400–3404, 2021.
  4. “Fusing asr outputs in joint training for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2022, pp. 7362–7366.
  5. “Iemocap: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, 2008.
  6. “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Trans. on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.
  7. “Investigation on joint representation learning for robust feature extraction in speech emotion recognition.,” in Interspeech, 2018, pp. 152–156.
  8. “A fine-tuned Wav2Vec 2.0/HuBERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
  9. “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2023.
  10. “Multimodal cross-and self-attention network for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2021, pp. 4275–4279.
  11. “Bimodal speech emotion recognition using pre-trained language models,” arXiv preprint arXiv:1912.02610, 2019.
  12. “Multistage linguistic conditioning of convolutional layers for speech emotion recognition,” Frontiers in Computer Science, vol. 5, pp. 1072479, 2023.
  13. “Exploring attention mechanisms for multimodal emotion recognition in an emergency call center corpus,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
  14. “Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network,” IEEE Access, vol. 8, pp. 61672–61686, 2020.
  15. “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  16. A. Hajavi and A. Etemad, “Audio representation learning by distilling video as privileged information,” IEEE Trans. on Artificial Intelligence, 2023.
  17. “Fast yet effective speech emotion recognition with self-distillation,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
  18. “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  19. “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Trans. on Affective Computing, vol. 7, no. 2, pp. 190–202, 2015.
  20. “Deep residual learning for image recognition,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  21. “Opensmile: the munich versatile and fast open-source audio feature extractor,” in ACM Int. Conf. on Multimedia, 2010, pp. 1459–1462.
  22. “Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
  23. “Light-sernet: A lightweight fully convolutional neural network for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2022, pp. 6912–6916.
  24. “Speech emotion recognition with local-global aware deep representation learning,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2020, pp. 7174–7178.
  25. “Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2021, pp. 6334–6338.
  26. “Speech sentiment analysis via pre-trained features from end-to-end asr models,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2020, pp. 7149–7153.
  27. “Speech emotion recognition using sequential capsule networks,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 3280–3291, 2021.
  28. “Speech emotion recognition with co-attention based multi-level acoustic information,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2022, pp. 7367–7371.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube