Emergent Mind

Abstract

Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates hallucination detection approaches using LLMs and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.

MCC scores reveal \llama excels in HRLs, Claude and GPT models excel in LRLs.

Overview

  • The paper tackles the challenge of detecting hallucinations in machine translation systems, focusing on both high-resource languages (HRLs) and low-resource languages (LRLs).

  • LLMs and various embedding-based methods are evaluated for their efficacy in hallucination detection via the \halomi benchmark dataset.

  • Key findings include the superior performance of LLMs in hallucination detection, with specific models excelling in different language contexts, and the competitive nature of embedding-based methods, particularly in high-resource settings.

Machine Translation Hallucination Detection for Low and High Resource Languages using LLMs

This paper investigates the challenge of detecting hallucinations in machine translation (MT) systems, with an emphasis on both high-resource languages (HRLs) and low-resource languages (LRLs). LLMs are evaluated for their efficacy in identifying hallucinations across these languages. The study spans 16 language pairs, operating under a massively multilingual framework to examine the performance differentials of various models and embedding-based methods.

Background and Problem Statement

Recent advancements in multilingual MT systems have enhanced translation accuracy significantly. Despite these improvements, hallucinations—instances where the model generates information not present in the source text—remain a critical issue, markedly impairing user trust. The detection of hallucinations has predominantly been successful in HRLs, leaving a substantial performance gap when applied to LRLs. The study assesses a range of LLMs and embedding spaces for hallucination detection, utilizing the \halomi benchmark dataset, which encompasses both HRLs and LRLs to provide a comprehensive evaluation scope.

Methodology

The paper utilizes the \halomi benchmark dataset, conducting a large-scale assessment involving:

  1. LLMs: Eight models with different prompt variations were tested, including GPT4-turbo, GPT4o, Command R, \crplus, Mistral-8x22b, Claude Sonnet, Claude Opus, and \llama.
  2. Embedding Spaces: Four spaces were analyzed—OpenAI's text-embedding-3-large, Cohere's Embed v3, Mistral's mistral-embed, and SONAR (the base for the current SOTA, BLASER-QE).

The evaluation framework considers binary hallucination detection and severity ranking. In the binary detection setting, the performance is measured by Matthews Correlation Coefficient (MCC). The optimal prompt for each LLM was selected based on validation results using EN$\leftrightarrow$DE directions. For embedding spaces, cosine similarity between source and translated texts was utilized, with thresholds optimized on the validation set.

Key Findings

Performance of LLMs: The study demonstrates that LLMs exhibit superior performance in hallucination detection across both HRLs and LRLs.

  • For HRLs, \llama significantly outperforms BLASER-QE with an MCC improvement of 16 points.
  • For LRLs, Claude Sonnet marginally surpasses other methods by an average of 0.03 MCC points, although the overall improvement over existing models is smaller.

Embedding-based Methods:

  • Embedding methods remain competitive in high-resource settings, particularly excelling for translation directions involving non-Latin scripts such as AR, RU, and ZH, suggesting high cross-script transfer learning capabilities.
  • SONAR embeddings perform comparably or superior to BLASER-QE in most HRL directions, indicating that the model's performance can be highly dependent on training data quality.

LLRL Performance Discrepancies: No single LLM uniformly excels across all LRL directions.

  • \llama performs best overall, but other models outperform it in specific LRL contexts.
  • For non-English-centric directions, such as ES$\leftrightarrow$YO, Opus leads, indicating the advanced analytical capabilities of LLMs even in limited relevant training data scenarios.

Implications

The findings underscore the importance of selecting appropriate models based on specific context requirements, especially considering resource levels and translation directions. The significant performance uplift presented by LLMs, despite their lack of explicit training for MT tasks, points to a broader applicability of these models in diverse linguistic contexts. Moreover, the competitive performance of embedding-based methods, particularly in HRLs, suggests their continued relevance in MT quality assessment frameworks.

Future Directions

The study highlights several avenues for future research:

  • Improved LRL Performance: There remains a need for models that offer robust performance across LRLs, suggesting potential in specialized training or fine-tuning for these languages.
  • Cross-script and Non-English-centric Translation Evaluation: Developing methods that can handle the nuances of non-Latin scripts and non-English-centric translations effectively.
  • Dataset Expansion: Expanding the \halomi dataset to include more diverse and balanced language pairs, addressing the class imbalances observed in the study.

Conclusion

This work stresses the effectiveness of LLMs and embedded semantic similarity in hallucination detection, establishing new state-of-the-art results for most evaluated language pairs. The research contributes significantly to the understanding of MT hallucination robustness across a wide spectrum of languages and scripts, advocating for future developments that prioritize LRLs and more complex multilingual translation scenarios. This study is instrumental for the MT research community as it navigates the intricate dynamics of hallucination detection, paving the way for more reliable and trustworthy translation systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.