Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition (2404.10904v2)
Abstract: Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly
- Human-computer interaction using emotion recognition from facial expression. In 2011 UKSim 5th European Symposium on Computer Modeling and Simulation, pages 196–201, 2011.
- Self-supervised learning by cross-modal audio-video clustering. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020. Curran Associates Inc.
- Alex Andonian. Pretorched-x. https://github.com/alexandonian/pretorched-x, 2024.
- Labelling unlabelled videos from scratch with multi-modal self-supervision. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- The inherently contextualized nature of facial emotion perception. Current Opinion in Psychology, 17:47–54, 2017.
- Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, Melbourne, Australia, 2018. Association for Computational Linguistics.
- Clear judgments based on unclear evidence: Person evaluation is strongly influenced by untrustworthy gossip. Emotion, 20(2):248–260, 2020.
- Context-aware interactive attention for multi-modal sentiment and emotion analysis. In Conference on Empirical Methods in Natural Language Processing, 2019.
- Multimodal clustering networks for self-supervised learning from unlabeled videos. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7992–8001, 2021.
- A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
- VoxCeleb2: Deep speaker recognition. In INTERSPEECH, 2018.
- Multimodal end-to-end sparse model for emotion recognition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 5305–5316. Association for Computational Linguistics, 2021.
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- Multi-task self-supervised visual learning. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2070–2079, 2017.
- Deepfake smiles matter less—the psychological and neural impact of presumed AI-generated faces. Scientific Reports, 13(1):16111, 2023.
- Paul Ekman. Universals and cultural differences in facial expressions of emotion. Nebraska Symposium on Motivation, 19:207–283, 1971.
- Context Is Routinely Encoded During Emotion Perception. Psychological science, 2010.
- Multimodal emotion recognition with modality-pairwise unsupervised contrastive loss. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 2589–2596, Los Alamitos, CA, USA, 2022. IEEE Computer Society.
- Unsupervised representation learning by predicting image rotations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
- Learning disentangled expression representations from facial images. arXiv preprint arXiv:2008.07001, 2020.
- Action-based contrastive learning for trajectory prediction. In Computer Vision – ECCV 2022, pages 143–159, Cham, 2022. Springer Nature Switzerland.
- Jointly discovering visual objects and spoken words from raw sensory input. International Journal of Computer Vision, 128:620 – 641, 2018.
- Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
- End-to-end learning for multimodal emotion recognition in video with adaptive loss. IEEE MultiMedia, 28(2):59–66, 2021.
- Large-scale representation learning from visually grounded untranscribed speech. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 55–65, Hong Kong, China, 2019. Association for Computational Linguistics.
- The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
- Self-supervised learning with cross-modal transformers for emotion recognition. 2021 IEEE Spoken Language Technology Workshop (SLT), pages 381–388, 2020.
- Self-supervision advances morphological profiling by unlocking powerful image representations. bioRxiv, 2024.
- Unsupervised multimodal language representations using convolutional autoencoders. CoRR, abs/2110.03007, 2021.
- Context Based Emotion Recognition using EMOTIC Dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2019.
- Colorization as a proxy task for visual understanding. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
- Context-Aware Emotion Recognition Networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10142–10151, 2019.
- Deep facial expression recognition: A survey. IEEE Transactions on Affective Computing, 13(3):1195–1215, 2022.
- Exploring disentangled feature representation beyond face identification. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2080–2089, 2018.
- Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2017a.
- SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017b.
- Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations. IEEE Transactions on Multimedia, pages 1–1, 2022.
- Knowledge-augmented face perception: Prospects for the Bayesian brain-framework to align AI and human vision. Consciousness and Cognition, 101:103301, 2022.
- Efficient estimation of word representations in vector space. In International Conference on Learning Representations, 2013.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision – ECCV 2016, pages 69–84, Cham, 2016. Springer International Publishing.
- Audio-visual scene analysis with self-supervised multisensory features. In European Conference on Computer Vision, 2018.
- Self-supervised exploration via disagreement. In ICML, 2019.
- MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, 2019.
- Spatiotemporal contrastive video representation learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6960–6970, Los Alamitos, CA, USA, 2021. IEEE Computer Society.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
- James A. Russell. Reading emotions from and into faces: Resurrecting a dimensional-contextual perspective. In The Psychology of Facial Expression, pages 295–320. Cambridge University Press, Cambridge, 1997.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019.
- Self-supervised learning for videos: A survey. ACM Computing Surveys, 55:1 – 37, 2022.
- Multilogue-net: A context-aware RNN for multi-modal emotion detection and sentiment analysis in conversation. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), pages 19–28, Seattle, USA, 2020. Association for Computational Linguistics.
- Revisiting self-supervised contrastive learning for facial expression recognition. In British Machine Vision Conference, 2022.
- Multimodal emotion recognition with transformer-based self supervised feature fusion. IEEE Access, 8:176274–176285, 2020.
- Perceiving emotions in neutral faces: Expression processing is biased by affective person knowledge. Social Cognitive and Affective Neuroscience, 10(4):531–536, 2015.
- Videobert: A joint model for video and language representation learning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7463–7472, 2019.
- Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20908–20921, 2022.
- Contrastive multiview coding. In Computer Vision – ECCV 2020, pages 776–794, Cham, 2020. Springer International Publishing.
- Cross-modal dynamic convolution for multi-modal emotion recognition. Journal of Visual Communication and Image Representation, 78:103178, 2021.
- Faces in Context: A Review and Systematization of Contextual Influences on Affective Face Processing. Frontiers in Psychology, 3, 2012.
- Putting Visual Object Recognition in Context. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12982–12991, Seattle, WA, USA, 2020. IEEE.
- Weakly supervised contrastive learning. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10022–10031, 2021.
- Marah Halawa (5 papers)
- Florian Blume (2 papers)
- Pia Bideau (10 papers)
- Martin Maier (11 papers)
- Rasha Abdel Rahman (2 papers)
- Olaf Hellwich (16 papers)