Emergent Mind

Abstract

In this study, we focus on continuous emotion recognition using body motion and speech signals to estimate Activation, Valence, and Dominance (AVD) attributes. Semi-End-To-End network architecture is proposed where both extracted features and raw signals are fed, and this network is trained using multi-task learning (MTL) rather than the state-of-the-art single task learning (STL). Furthermore, correlation losses, Concordance Correlation Coefficient (CCC) and Pearson Correlation Coefficient (PCC), are used as an optimization objective during the training. Experiments are conducted on CreativeIT and RECOLA database, and evaluations are performed using the CCC metric. To highlight the effect of MTL, correlation losses and multi-modality, we respectively compare the performance of MTL against STL, CCC loss against root mean square error (MSE) loss and, PCC loss, multi-modality against single modality. We observe significant performance improvements with MTL training over STL, especially for estimation of the valence. Furthermore, the CCC loss achieves more than 7% CCC improvements on CreativeIT, and 13% improvements on RECOLA against MSE loss.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.