Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings (2104.03502v1)

Published 8 Apr 2021 in cs.SD, cs.LG, and eess.AS

Abstract: Emotion recognition datasets are relatively small, making the use of the more sophisticated deep learning approaches challenging. In this work, we propose a transfer learning method for speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using simple neural networks. We propose to combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model. Further, we compare performance using two different wav2vec 2.0 models, with and without finetuning for speech recognition. We evaluate our proposed approaches on two standard emotion databases IEMOCAP and RAVDESS, showing superior performance compared to results in the literature.

Citations (326)

View on Semantic Scholar

Summary

The paper introduces a novel approach that extracts and refines wav2vec 2.0 features for enhanced emotion recognition from speech.
It compares configurations using raw audio (wav2vec2-PT) versus ASR-finetuned models, showing that raw audio preserves emotional cues more effectively.
The research reveals that simple downstream neural networks, especially when combined with prosodic features, can efficiently utilize self-supervised embeddings for robust emotion detection.

Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings

The paper "Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings" explores an innovative approach to speech emotion recognition leveraging transfer learning and the capabilities of the wav2vec 2.0 model. In the field of machine learning, emotion recognition from speech presents a unique challenge due to the typically small size of available labeled datasets, which hinders the effectiveness of sophisticated deep learning models.

Methodology and Models

This research employs the wav2vec 2.0 model, a self-supervised learning framework initially designed for automatic speech recognition (ASR), as a feature extractor for the emotional recognition task from audio data. The authors introduce a methodology that extracts and refines features from various layers of the wav2vec 2.0 model, using them to train downstream models.

The researchers explore two configurations of wav2vec 2.0: one with preprocessing (wav2vec2-PT), using raw audio data only, and another fine-tuned for ASR tasks on a subset of the LibriSpeech dataset (wav2vec2-FT). They employ both models to extract features which are subsequently evaluated using neural networks with distinct architectural complexities. A notable aspect of this paper is the fusion of outputs from multiple transformer layers within the wav2vec 2.0 for enhanced feature representation, an approach that demonstrated improved performance.

Datasets and Evaluation

The evaluation was conducted on two benchmark datasets: IEMOCAP and RAVDESS. These datasets are crucial to validate the model's effectiveness, given their established use in emotion recognition tasks. The authors assessed the performance using average recall, a measure that provides insight into the model's ability to accurately identify emotions across multiple classes.

Key Results

The analysis reveals that using a weighted combination of outputs from various model layers significantly enhances the model performance over using single-layer outputs. Notably, the wav2vec2-PT model generally outperformed the wav2vec2-FT model in emotion recognition tasks. This intriguing result points to the potential loss of useful emotional information during the ASR-specific finetuning process.

Furthermore, the research demonstrates that simple downstream models like dense layers perform competitively when compared to more complex recurrent networks, which suggests that wav2vec 2.0 might already encapsulate temporal dependencies sufficiently. Additionally, combining wav2vec 2.0 embeddings with prosodic features like eGeMAPS showed incremental performance gains, demonstrating the value of multimodal feature fusion.

Implications and Future Directions

The findings of this work have substantial implications for the field of speech emotion recognition. By harnessing the capabilities of self-supervised models like wav2vec 2.0, researchers can potentially bypass the scarcity of labeled data, traditionally a major bottleneck. The success of leveraging weighted layer combinations opens up new avenues for feature extraction strategies in other fields as well.

In future research, exploring additional finetuning strategies on wav2vec 2.0 could be beneficial to better preserve emotion-relevant information. Additionally, investigating the use of larger, more diverse emotion datasets could offer further insights into the generalizability of this approach. The incorporation of other self-supervised models into the pipeline, alongside an enhanced understanding of their internal layer dynamics, presents a promising direction for the evolution of speech emotion recognition capabilities.

PDF Markdown