Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion Recognition

Published 16 Mar 2021 in cs.CV, cs.LG, cs.SD, and eess.AS | (2103.09154v2)

Abstract: Emotional expressions are the behaviors that communicate our emotional state or attitude to others. They are expressed through verbal and non-verbal communication. Complex human behavior can be understood by studying physical features from multiple modalities; mainly facial, vocal and physical gestures. Recently, spontaneous multi-modal emotion recognition has been extensively studied for human behavior analysis. In this paper, we propose a new deep learning-based approach for audio-visual emotion recognition. Our approach leverages recent advances in deep learning like knowledge distillation and high-performing deep architectures. The deep feature representations of the audio and visual modalities are fused based on a model-level fusion strategy. A recurrent neural network is then used to capture the temporal dynamics. Our proposed approach substantially outperforms state-of-the-art approaches in predicting valence on the RECOLA dataset. Moreover, our proposed visual facial expression feature extraction network outperforms state-of-the-art results on the AffectNet and Google Facial Expression Comparison datasets.

Abstract PDF Upgrade to Chat

Citations (161)

View on Semantic Scholar

Summary

The paper introduces a novel deep learning framework using knowledge distillation and advanced architectures to fuse audio and visual modalities for improved emotion recognition.
The framework utilizes a visual CNN with self-distillation, a modified audio VGGish, and an LSTM-based model-level fusion strategy to capture spatio-temporal dynamics.
Experimental results on the RECOLA dataset show significant improvement in valence prediction (CCC 0.740) and state-of-the-art visual recognition accuracy, with implications for HCI and mental health applications.

Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion Recognition

The paper “Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion Recognition” explores the application of cutting-edge deep learning techniques to improve audio-visual emotion recognition, an area of significant interest within affective computing and broader artificial intelligence domains. The authors introduce a novel framework taking advantage of knowledge distillation and advanced deep architectures to effectively fuse audio and visual modalities and capture temporal dynamics via recurrent neural networks.

Methodology and Key Contributions

The proposed approach divides the emotion recognition task into three core components:

Visual Facial Expression Embedding Network: This consists of a deep convolutional neural network (CNN) trained via a self-distillation strategy using AffectNet and Google Facial Expression Comparison datasets. The self-distillation method significantly enhances model robustness by leveraging regularization effects.
Audio Embedding Network for Emotion Recognition: Based on the VGGish architecture, this component is fine-tuned to extract emotion features from audio data. Modifications to this architecture allow the effective extraction of emotional characteristics from audio signals, demonstrated with competitive results on test datasets.
Model-Level Fusion of Audio-Visual Features: The fusion model integrates the extracted features from both audio and visual modalities utilizing a model-level fusion strategy processed through LSTM networks. This setup facilitates robust learning of spatio-temporal dynamics inherent in emotion recognition from video sequences.

Results and Discussion

The experimental validation, performed on the RECOLA dataset, indicates remarkable improvements over existing state-of-the-art methods in predicting valence while achieving comparable results in arousal prediction. Specifically, the framework demonstrates a Concordance Correlation Coefficient (CCC) of 0.740 for valence prediction on the test dataset—a substantial improvement considering previous benchmarks such as the CCC of 0.612 from \cite{Tzarakis2017}.

Moreover, the visual facial expression embedding network achieved state-of-the-art accuracy rates of 61.6% on AffectNet and 86.5% on Google FEC, showcasing the efficacy of the novel self-distillation and dual-dataset training strategy. The audio embedding network using modified VGGish also reported strong performance with a CCC of 0.70 when predicting arousal.

Implications and Future Directions

The proposed framework’s success suggests promising possibilities for real-world applications in human-computer interaction, allowing systems to interpret and adapt to human emotions more accurately. The substantial improvements in the domain of affect recognition could significantly benefit areas such as assistive technology, mental health analytics, and personalized user experiences.

Looking forward, there are several interesting directions for further research. Exploration of larger unlabeled datasets could better validate the impact of self-distillation augmented by unsupervised learning. Additionally, the implementation of joint audio-visual features aligned through more sophisticated fusion strategies may improve temporal modeling and address some of the nuanced behaviors in in-the-wild emotion recognition tasks.

This paper establishes a strong foundation in the use of deep learning methodologies for multimodal emotion recognition, paving the way for enhanced affective computational models that adhere to real-world conditions and demands.

Markdown Report Issue