Emergent Mind

MELD-ST: An Emotion-aware Speech Translation Dataset

(2405.13233)
Published May 21, 2024 in cs.CL

Abstract

Emotion plays a crucial role in human conversation. This paper underscores the significance of considering emotion in speech translation. We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs. Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset. Baseline experiments using the SeamlessM4T model on the dataset indicate that fine-tuning with emotion labels can enhance translation performance in some settings, highlighting the need for further research in emotion-aware speech translation systems.

Overview

  • The paper introduces MELD-ST, a dataset aimed at addressing the integration of emotional nuances in speech translation, covering English-to-Japanese and English-to-German translations with 10,000 annotated utterances per language pair.

  • The dataset is derived from the Multimodal EmotionLines Dataset (MELD), including data from the TV series Friends, focusing on aligning emotional context with spoken language to improve translation quality.

  • The study evaluates baseline models and finds that incorporating emotion labels significantly enhances translation performance, particularly for language pairs with distinct lexical and cultural differences, while also indicating areas for further research.

Emotion-Aware Speech Translation: The MELD-ST Dataset

The paper presented focuses on the often-overlooked aspect of emotion in speech translation (ST), introducing the MELD-ST dataset to address this critical gap. The scope encompasses English-to-Japanese (En-Ja) and English-to-German (En-De) translation tasks, with 10,000 annotated utterances per language pair from the Multimodal EmotionLines Dataset (MELD). The contribution is particularly salient given the inherent emotional nuances in human conversations, which are typically conveyed through vocal tones, facial expressions, and other multimodal cues, and thus have significant ramifications for NLP tasks.

Introduction and Motivation

The introduction delineates the motivation behind the research, emphasizing the indispensability of accurately conveying emotions in cross-linguistic translation to preserve the intended intensity and sentiment. This is exemplified by the phrase "Oh my God!" which can vary significantly in translation depending on its emotional context. Prior initiatives in machine translation (MT) have begun to explore emotion-aware translation, but these efforts have largely been confined to text-to-text translation (T2TT). Conversely, there has been scant attention to emotion in speech-to-text translation (S2TT) and speech-to-speech translation (S2ST), despite marked improvements in ST performance with the advent of sophisticated datasets and models.

MELD-ST Dataset Creation

The MELD-ST dataset emerges as a novel resource, comprising approximately 10,000 utterances per language pair. The data is mined from the TV series Friends, inheriting the emotion labels from the MELD dataset. Table \ref{tab:dataset} provides a detailed breakdown of the dataset statistics, including the number of utterances and the duration of English and target language speech.

Subtitles and Timestamp Extraction: This phase involved utilizing OCR tools to convert subtitle images into text and aligning these texts with speech using timestamps.

Text Cleaning and Alignment: Heuristics were employed to mitigate OCR errors and speaker name duplications, followed by a careful alignment process that combined audio extraction and CTC segmentation for precise timestamp corrections.

Data Splitting and Emotion Label Distribution: The dataset was meticulously split into training, development, and test sets, with special attention to emotion label distribution to ensure the robustness of experimental analyses. Table \ref{tab:Emotion} delivers insights into the emotion distribution within the dataset.

Experimental Settings

Baseline models for both S2TT and S2ST tasks were derived from the SeamlessM4T v2 model, utilizing specialized training configurations and comparison approaches:

  • No fine-tuning
  • Fine-tuning without emotion labels
  • Fine-tuning with emotion labels

Three data configurations were used for fine-tuning: separate En-Ja and En-De datasets, and a mixed dataset combining both. The evaluation metrics included BLEURT for S2TT and ASR-BLEU for S2ST, with prosody evaluation metrics outlined in supplementary materials.

Results and Discussion

S2TT Results: The fine-tuning with emotion labels notably improved the translation quality in certain configurations, as seen in Table \ref{tab:s2tt-results}. For instance, incorporating emotion labels resulted in a statistically significant improvement in BLEURT scores for the En-Ja pair, affirming the pertinence of emotion annotation to translation fidelity.

S2ST Results: Fine-tuning the SeamlessM4T model enhanced the ASR-BLEU results, albeit modestly. The prosody and vocal similarity metrics remained largely unchanged, exposing the limitations of SeamlessM4T in capturing nuanced pronunciation features, a gap that may benefit from models such as SeamlessExpressive dedicated to prosodic fidelity (Table \ref{tab:s2st-results}).

Discussion: The En-De pair consistently outperformed En-Ja across both tasks, attributable to the linguistic proximity between English and German. The manual inspection suggested that emotion labels did not substantially alter the translation outcome in these language pairs, highlighting a potential area for future investigation into more refined emotion-sensitive translation mechanisms.

Conclusion and Limitations

This study introduces the MELD-ST dataset, a pioneering corpus designed to advance emotion-aware speech translation. Initial experiments underscored the potential of emotion labels in enhancing translation quality, particularly for language pairs with significant lexical and cultural divergences. Future research could pivot towards integrating multitask models that concurrently train for speech emotion recognition and ST, and exploring dialogue context in translation for a more holistic approach.

Limitations: Alignment discrepancies in the dataset and the reliance on acted speech underscore the necessity for further research in spontaneous dialogue contexts. Moreover, the basic ST models used could be augmented with more sophisticated architectures designed explicitly for emotion-aware applications.

Ethics Statement: The dataset will be made available under restricted access to prevent misuse and ensure it serves the intended purpose of advancing research in emotion-aware speech translation.

In closing, the introduction of MELD-ST marks a significant foray into the nuanced realm of emotion-aware ST, laying the groundwork for future explorations into the seamless integration of emotional contexts in automated translation technologies.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.