Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning (2406.11161v2)

Published 17 Jun 2024 in cs.AI and cs.MM

Abstract: Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal LLMs (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023-SEMI challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces Emotion-LLaMA, a novel multimodal large language model that achieves state-of-the-art performance in emotion recognition and reasoning.
Authors created the MERR dataset with 28,618 coarse and 4,487 fine-grained samples to facilitate learning across diverse emotion scenarios.
Emotion-LLaMA integrates audio, visual, and textual inputs using specialized encoders and achieves superior results, like a 0.9036 F1 score on MER2023 and 45.59 UAR zero-shot on DFEW.

The paper "Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning" (Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning, 17 Jun 2024) introduces a novel multimodal LLM (MLLM) designed for advanced emotion recognition and reasoning. The key contributions and findings of this work are:

MERR Dataset: The authors created a new Multimodal Emotion Recognition and Reasoning (MERR) dataset. This dataset contains 28,618 coarse-grained and 4,487 fine-grained annotated samples, encompassing a broad spectrum of emotional categories. The MERR dataset addresses the limitations of existing multimodal emotion instruction datasets and facilitates learning across diverse scenarios.
Emotion-LLaMA Model: The paper details the development of Emotion-LLaMA, an MLLM integrating audio, visual, and textual inputs through specialized emotion encoders. The model uses HuBERT for processing audio data and employs multiview visual encoders, including MAE, VideoMAE, and EVA, to capture detailed facial information. Instruction tuning is used to refine emotional recognition and reasoning capabilities.
Performance Benchmarking: Emotion-LLaMA was evaluated extensively, demonstrating superior performance compared to other MLLMs across multiple datasets. Key performance metrics include:
- Clue Overlap score of 7.83 on the EMER dataset
- Label Overlap score of 6.25 on the EMER dataset
- F1 score of 0.9036 on the MER2023 challenge
- Unweighted Average Recall (UAR) of 45.59 in zero-shot evaluations on the DFEW dataset
- Weighted Average Recall (WAR) of 59.37 in zero-shot evaluations on the DFEW dataset

The paper's primary finding is that Emotion-LLaMA significantly improves emotional recognition and reasoning through effective multimodal input integration and instruction tuning. This establishes a new state-of-the-art for multimodal emotion analysis.

PDF Markdown

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning (2406.11161v2)

Summary

Related Papers