A vector quantized masked autoencoder for audiovisual speech emotion recognition (2305.03568v3)
Abstract: An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder-decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.
- VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems (NeurIPS), 34, 24206–24221.
- Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems (NeurIPS), 33, 25–37.
- ViViT: A video vision transformer. In IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 6836–6846).
- Influence of lips on the production of vowels based on finite element simulations and experiments. The Journal of the Acoustical Society of America, 139, 2852–2859.
- MAE-AST: Masked autoencoding audio spectrogram transformer. In International Speech Communication Association (Interspeech) (pp. 2438–2442).
- MultiMAE: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision (ECCV) (pp. 348–367). Springer.
- BEiT: BERT pre-training of image transformers. In International Conference on Learning Representations (ICLR).
- Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems (NeurIPS), 19.
- How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In IEEE/CVF International Conference on Computer Vision (ICCV).
- IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42, 335–359.
- CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5, 377–390.
- Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). IEEE.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML) (pp. 1597–1607). PMLR.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, .
- Self-attention fusion for audiovisual emotion recognition with incomplete data. In International Conference on Pattern Recognition (ICPR) (pp. 2822–2828). IEEE.
- VoxCeleb2: Deep speaker recognition. In International Speech Communication Association (Interspeech) (pp. 1086–1090). doi:10.21437/Interspeech.2018-1929.
- BERT: Pre-training of deep bidirectional transformers for language understanding. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (pp. 4171–4186).
- S2f2: Self-supervised high fidelity face reconstruction from monocular image. In IEEE International Conference on Automatic Face and Gesture Recognition (FG) (pp. 1–8). IEEE.
- PeCo: Perceptual codebook for BERT pre-training of vision transformers. arXiv preprint arXiv:2111.12710, .
- An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR), .
- Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern recognition, 44, 572–587.
- Taming transformers for high-resolution image synthesis. In IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 12873–12883).
- Masked autoencoders as spatiotemporal learners. Advances in Neural Information Processing Systems (NeurIPS), 35, 35946–35958.
- Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural computation, 21, 793–830.
- A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition. arXiv preprint arXiv:2111.02172, .
- Music gesture for visual sound separation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10478–10487).
- Co-separating sounds of visual objects. In IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 3879–3888).
- Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204, .
- Multimodal attention-mechanism for temporal emotion recognition. In IEEE International Conference on Image Processing (ICIP) (pp. 251–255). IEEE.
- Multimodal and temporal perception of audio-visual cues for emotion recognition. In International Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 552–558). IEEE.
- Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR).
- Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features. IEEE Transactions on Affective Computing, 13, 2156–2170.
- SSAST: Self-supervised audio spectrogram transformer. In AAAI Conference on Artificial Intelligence (pp. 10699–10709). volume 36.
- Contrastive audio-visual masked autoencoder. In International Conference on Learning Representations (ICLR).
- Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, .
- Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 16000–16009).
- Masked autoencoders that listen. Advances in Neural Information Processing Systems (NeurIPS), 35, 28708–28720.
- Contrastive masked autoencoders are stronger vision learners. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 1–13.
- SS-VAERR: Self-supervised apparent emotional reaction recognition from video. In IEEE International Conference on Automatic Face and Gesture Recognition (FG) (pp. 1–8). IEEE.
- MAGE: Masked generative encoder to unify representation learning and image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2142–2152).
- Audio self-supervised learning: A survey. Patterns, 3, 100616.
- Query2label: A simple transformer way to multi-label classification. preprint arXiv:2107.10834, .
- The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13.
- Decoupled weight decay regularization. International Conference on Learning Representations (ICLR), .
- Mehrabian, A. (2017). Nonverbal communication. Routledge.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision (ECCV) (pp. 69–84). Springer.
- Emotion recognition from speech using wav2vec 2.0 embeddings. International Speech Communication Association (Interspeech), (pp. 3400–3404).
- Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine, 34, 96–108.
- Asymmetric loss for multi-label classification. In IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 82–91). IEEE Computer Society.
- A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Networks, 172, 106120. URL: https://www.sciencedirect.com/science/article/pii/S0893608024000340.
- A vector quantized masked autoencoder for speech emotion recognition. In IEEE ICASSP 2023 Workshop on Self-Supervision in Audio, Speech and Beyond (SASB).
- Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognition Letters, 146, 1–7.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in Neural Information Processing Systems (NeurIPS), 35, 10078–10093.
- Augmenting convolutional networks with attention-based aggregation. arXiv preprint arXiv:2112.13692, .
- A pre-trained audio-visual transformer for emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4698–4702). IEEE.
- Multimodal transformer for unaligned multimodal language sequences. In Association for Computational Linguistics. Meeting (p. 6558). NIH Public Access volume 2019.
- Neural discrete representation learning. Advances in Neural Information Processing Systems (NeurIPS), 30.
- Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30.
- A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. arXiv preprint arXiv:2111.02735, .
- Simmim: A simple framework for masked image modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9653–9663).
- Vector-quantized image modeling with improved vqgan. In International Conference on Learning Representations (ICLR).
- A survey on masked autoencoder for self-supervised learning in vision and beyond. arXiv preprint arXiv:2208.00173, .
- The sound of motions. In IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 1735–1744).