Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 48 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 473 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition (2403.19554v1)

Published 28 Mar 2024 in cs.CV

Abstract: In video-based emotion recognition, audio and visual modalities are often expected to have a complementary relationship, which is widely explored using cross-attention. However, they may also exhibit weak complementary relationships, resulting in poor representations of audio-visual features, thus degrading the performance of the system. To address this issue, we propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly based on their strong or weak complementary relationship with each other, respectively. Specifically, a simple yet efficient gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit a strong complementary relationship, otherwise unattended features. We evaluate the performance of the proposed approach on the challenging RECOLA and Aff-Wild2 datasets. We also compare the proposed approach with other variants of cross-attention and show that the proposed model consistently improves the performance on both datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks,” in Interspeech, 2020.
  2. “Cross attentional audio-visual fusion for dimensional emotion recognition,” in FG, 2021.
  3. “Multimodal learning with transformers: A survey,” TPAMI, 2023.
  4. “Leaky gated cross-attention for weakly supervised multi-modal temporal action localization,” in WACV, 2022.
  5. “M2lens: Visualizing and explaining multimodal models for sentiment analysis,” TVCG, 2022.
  6. “Time-continuous audiovisual fusion with recurrence vs attention for in-the-wild affect recognition,” in CVPRW, 2022.
  7. “Leveraging tcn and transformer for effective visual-audio fusion in continuous emotion recognition,” in CVPRW, 2023.
  8. “Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild,” in CVPRW, 2022.
  9. “Detecting expressions with multimodal transformers,” in IEEE SLT Workshop, 2021.
  10. “Continuous emotion recognition with audio-visual leader-follower attentive fusion,” in ICCVW, 2021.
  11. “Multi-modal facial affective analysis based on masked autoencoder,” in IEEE CVPRW, 2023.
  12. “Audio-visual fusion for emotion recognition in the valence-arousal space using joint cross-attention,” TBIOM, 2023.
  13. “Recursive joint attention for audio-visual fusion in regression based emotion recognition,” in ICASSP, 2023.
  14. “Not all attention is needed: Gated attention network for sequence data,” AAAI, 2020.
  15. “Gated mechanism for attention based multi modal sentiment analysis,” in ICASSP, 2020.
  16. “Embrace smaller attention: Efficient cross-modal matching with dual gated attention fusion,” in Proc. of IEEE ICASSP, 2023.
  17. “Gated attention fusion network for multimodal sentiment classification,” Knowledge-Based Systems, 2022.
  18. “Audio-visual gated-sequenced neural networks for affect recognition,” TAC, 2022.
  19. “Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition,” in Interspeech, 2020.
  20. “Regularizing deep neural networks by noise: Its interpretation and optimization,” in NIPS, 2017.
  21. “Introducing the recola multimodal corpus of remote collaborative and affective interactions,” in FG, 2013.
  22. “Abaw: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges,” in CVPRW, 2023.
  23. “Leveraging recent advances in deep learning for audio-visual emotion recognition,” PR Letters, 2021.
  24. “End-to-end multimodal emotion recognition using deep neural networks,” JSTSP, 2017.
  25. “Emotion recognition using fusion of audio and video features,” in SMC, 2019.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.