Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition (2403.19554v1)
Abstract: In video-based emotion recognition, audio and visual modalities are often expected to have a complementary relationship, which is widely explored using cross-attention. However, they may also exhibit weak complementary relationships, resulting in poor representations of audio-visual features, thus degrading the performance of the system. To address this issue, we propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly based on their strong or weak complementary relationship with each other, respectively. Specifically, a simple yet efficient gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit a strong complementary relationship, otherwise unattended features. We evaluate the performance of the proposed approach on the challenging RECOLA and Aff-Wild2 datasets. We also compare the proposed approach with other variants of cross-attention and show that the proposed model consistently improves the performance on both datasets.
- “Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks,” in Interspeech, 2020.
- “Cross attentional audio-visual fusion for dimensional emotion recognition,” in FG, 2021.
- “Multimodal learning with transformers: A survey,” TPAMI, 2023.
- “Leaky gated cross-attention for weakly supervised multi-modal temporal action localization,” in WACV, 2022.
- “M2lens: Visualizing and explaining multimodal models for sentiment analysis,” TVCG, 2022.
- “Time-continuous audiovisual fusion with recurrence vs attention for in-the-wild affect recognition,” in CVPRW, 2022.
- “Leveraging tcn and transformer for effective visual-audio fusion in continuous emotion recognition,” in CVPRW, 2023.
- “Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild,” in CVPRW, 2022.
- “Detecting expressions with multimodal transformers,” in IEEE SLT Workshop, 2021.
- “Continuous emotion recognition with audio-visual leader-follower attentive fusion,” in ICCVW, 2021.
- “Multi-modal facial affective analysis based on masked autoencoder,” in IEEE CVPRW, 2023.
- “Audio-visual fusion for emotion recognition in the valence-arousal space using joint cross-attention,” TBIOM, 2023.
- “Recursive joint attention for audio-visual fusion in regression based emotion recognition,” in ICASSP, 2023.
- “Not all attention is needed: Gated attention network for sequence data,” AAAI, 2020.
- “Gated mechanism for attention based multi modal sentiment analysis,” in ICASSP, 2020.
- “Embrace smaller attention: Efficient cross-modal matching with dual gated attention fusion,” in Proc. of IEEE ICASSP, 2023.
- “Gated attention fusion network for multimodal sentiment classification,” Knowledge-Based Systems, 2022.
- “Audio-visual gated-sequenced neural networks for affect recognition,” TAC, 2022.
- “Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition,” in Interspeech, 2020.
- “Regularizing deep neural networks by noise: Its interpretation and optimization,” in NIPS, 2017.
- “Introducing the recola multimodal corpus of remote collaborative and affective interactions,” in FG, 2013.
- “Abaw: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges,” in CVPRW, 2023.
- “Leveraging recent advances in deep learning for audio-visual emotion recognition,” PR Letters, 2021.
- “End-to-end multimodal emotion recognition using deep neural networks,” JSTSP, 2017.
- “Emotion recognition using fusion of audio and video features,” in SMC, 2019.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.