Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models (2407.16470v3)
Abstract: Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates sentence-level hallucination detection approaches using LLMs and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.
- Hallucinations in neural machine translation.
- Seamlessm4t: Massively multilingual & multimodal machine translation. Preprint, arXiv:2308.11596.
- Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity even better. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1:36–50.
- Halomi: A manually annotated benchmark for multilingual hallucination and omission detection in machine translation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
- Sonar: Sentence-level multimodal and language-agnostic representations.
- Beyond english-centric multilingual machine translation. Preprint, arXiv:2010.11125.
- Language-agnostic bert sentence embedding. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1:878–891.
- The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation, pages 1066–1083, Singapore. Association for Computational Linguistics.
- Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowledge-Based Systems, 42:97–110.
- xcomet: Transparent machine translation evaluation through fine-grained error detection. Preprint, arXiv:2310.10482.
- Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1059–1075, Dubrovnik, Croatia. Association for Computational Linguistics.
- Are large language model-based evaluators the solution to scaling up multilingual evaluation? EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024, pages 1051–1070.
- Bitext mining using distilled sentence representations for low-resource languages. Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2101–2112.
- Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023, pages 193–203.
- Better zero-shot reasoning with role-play prompting. Preprint, arXiv:2308.07702.
- Madlad-400: A multilingual and document-level large audited dataset. Preprint, arXiv:2309.04662.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
- The effect of dataset size on training tweet sentiment classifiers. Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015, pages 96–102.
- Salted: A framework for salient long-tail translation error detection. Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5163–5179.
- Margarita Sordo and Qing Zeng. 2005. On sample size and classification accuracy: A performance comparison. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3745 LNBI:193–201.
- No language left behind: Scaling human-centered machine translation. Preprint, arXiv:2207.04672.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35.
- Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection. Transactions of the Association for Computational Linguistics, 11:546–564.
- INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Singapore. Association for Computational Linguistics.
- Multilingual machine translation with large language models: Empirical results and analysis. Preprint, arXiv:2304.04675.