Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning (2406.11161v2)

Published 17 Jun 2024 in cs.AI and cs.MM

Abstract: Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal LLMs (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023-SEMI challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.

Citations (7)

Summary

  • The paper introduces Emotion-LLaMA, a novel multimodal large language model that achieves state-of-the-art performance in emotion recognition and reasoning.
  • Authors created the MERR dataset with 28,618 coarse and 4,487 fine-grained samples to facilitate learning across diverse emotion scenarios.
  • Emotion-LLaMA integrates audio, visual, and textual inputs using specialized encoders and achieves superior results, like a 0.9036 F1 score on MER2023 and 45.59 UAR zero-shot on DFEW.

The paper "Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning" (2406.11161) introduces a novel multimodal LLM (MLLM) designed for advanced emotion recognition and reasoning. The key contributions and findings of this work are:

  • MERR Dataset: The authors created a new Multimodal Emotion Recognition and Reasoning (MERR) dataset. This dataset contains 28,618 coarse-grained and 4,487 fine-grained annotated samples, encompassing a broad spectrum of emotional categories. The MERR dataset addresses the limitations of existing multimodal emotion instruction datasets and facilitates learning across diverse scenarios.
  • Emotion-LLaMA Model: The paper details the development of Emotion-LLaMA, an MLLM integrating audio, visual, and textual inputs through specialized emotion encoders. The model uses HuBERT for processing audio data and employs multiview visual encoders, including MAE, VideoMAE, and EVA, to capture detailed facial information. Instruction tuning is used to refine emotional recognition and reasoning capabilities.
  • Performance Benchmarking: Emotion-LLaMA was evaluated extensively, demonstrating superior performance compared to other MLLMs across multiple datasets. Key performance metrics include:
    • Clue Overlap score of 7.83 on the EMER dataset
    • Label Overlap score of 6.25 on the EMER dataset
    • F1 score of 0.9036 on the MER2023 challenge
    • Unweighted Average Recall (UAR) of 45.59 in zero-shot evaluations on the DFEW dataset
    • Weighted Average Recall (WAR) of 59.37 in zero-shot evaluations on the DFEW dataset

The paper's primary finding is that Emotion-LLaMA significantly improves emotional recognition and reasoning through effective multimodal input integration and instruction tuning. This establishes a new state-of-the-art for multimodal emotion analysis.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube