MT-Ranker: Reference-free machine translation evaluation by inter-system ranking (2401.17099v1)

Published 30 Jan 2024 in cs.CL

Abstract: Traditionally, Machine Translation (MT) Evaluation has been treated as a regression problem -- producing an absolute translation-quality score. This approach has two limitations: i) the scores lack interpretability, and human annotators struggle with giving consistent scores; ii) most scoring methods are based on (reference, translation) pairs, limiting their applicability in real-world scenarios where references are absent. In practice, we often care about whether a new MT system is better or worse than some competitors. In addition, reference-free MT evaluation is increasingly practical and necessary. Unfortunately, these two practical considerations have yet to be jointly explored. In this work, we formulate the reference-free MT evaluation into a pairwise ranking problem. Given the source sentence and a pair of translations, our system predicts which translation is better. In addition to proposing this new formulation, we further show that this new paradigm can demonstrate superior correlation with human judgments by merely using indirect supervision from natural language inference and weak supervision from our synthetic data. In the context of reference-free evaluation, MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21. On a more challenging benchmark, ACES, which contains fine-grained evaluation criteria such as addition, omission, and mistranslation errors, MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.

Citations (2)

View on Semantic Scholar

Summary

The paper presents a novel pairwise ranking framework for evaluating MT quality without using reference translations.
It employs multitask learning with multilingual NLI and synthetic data to fine-tune a multilingual T5 encoder.
Empirical results show that MT-Ranker outperforms baseline models and correlates strongly with human judgments on diverse benchmarks.

Overview

The domain of Machine Translation (MT) Evaluation traditionally focuses on regression tasks, attempting to assign quality scores to translations. This conventional method poses critical limitations: scores often lack interpretability and fail to be meaningful in scenarios lacking reference translations. The authors of this work propose a departure from the traditional scoring system, framing the problem instead as a pairwise ranking task. The innovative system, called MT-Ranker, is devised to predict which of two given translations is superior based on the source sentence alone. This model paves the way for a practical, reference-free MT evaluation, operating under the premise that high-quality manual annotations, which are difficult to gather consistently, are not as critical as once believed.

Methodology

MT-Ranker introduces a paradigm shift by formulating the input as a single text that encompasses the source sentence and two candidate translations. The authors back their approach by exploiting multitask learning that utilizes multilingual natural language inference (NLI) and synthetic data generation techniques to prepare the system for the task at hand. The model's architecture is grounded on the encoder of multilingual T5, utilizing mean pooling and logistic regression, and it is fine-tuned through a three-stage process:

Initial pretraining on the XNLI dataset serves as an indirect supervision step.
Subsequent fine-tuning to delineate human from machine translations.
Final fine-tuning on a rich corpus of synthetic data, generating robust perturbation-based better-worse translation judgments.

Empirical Results

Empirical evaluations substantiate that MT-Ranker, trained devoid of task-specific human annotations, indeed achieves state-of-the-art performance against other reference-free and reference-based MT evaluation models. On the ACES benchmark containing over a hundred language pairs and nuanced criteria—ranging from addition and omission errors to mistranslation—MT-Ranker outperforms existing baselines. Moreover, strong correlations with human judgments are reported across diverse benchmarks including DARR20, MQM20-22, highlighting MT-Ranker's robustness and its emancipation from reliance on human-provided reference translations.

Implications and Further Analysis

What makes MT-Ranker particularly intriguing is its disregard for human-annotated data, which is often considered essential in building evaluation metrics. This not only has implications for greater practical utility but also opens discussions on the nature of translation quality and the mechanisms through which it can be reliably judged. The authors' analysis further explores the model's three training stages, the potential for further improvement with human annotations, generalizability, and performance in detecting various error types.

The evaluative process is eloquently restructured by MT-Ranker, emphasizing the comparative quality of translations rather than the absolute. Such an approach is more attuned to practical applications of MT, where the comparative effectiveness of different systems is more sought after than the ethereal quest for a perfect translation score. With the authors' considered effort in eschewing supervisory dependence on human judgment and steering towards an autonomous evaluative stance, MT-Ranker represents a novel pivot in the MT evaluation landscape—one that likely nudges future research towards novel paradigms of machine-driven, interpretative critique of linguistic translations.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Wenpeng_Yin/status/1752701668620394745