Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing

Published 30 Apr 2020 in cs.CL | (2004.14564v2)

Abstract: We frame the task of machine translation evaluation as one of scoring machine translation output with a sequence-to-sequence paraphraser, conditioned on a human reference. We propose training the paraphraser as a multilingual NMT system, treating paraphrasing as a zero-shot translation task (e.g., Czech to Czech). This results in the paraphraser's output mode being centered around a copy of the input sequence, which represents the best case scenario where the MT system output matches a human reference. Our method is simple and intuitive, and does not require human judgements for training. Our single model (trained in 39 languages) outperforms or statistically ties with all prior metrics on the WMT 2019 segment-level shared metrics task in all languages (excluding Gujarati where the model had no training data). We also explore using our model for the task of quality estimation as a metric--conditioning on the source instead of the reference--and find that it significantly outperforms every submission to the WMT 2019 shared task on quality estimation in every language pair.

Abstract PDF Upgrade to Chat

Citations (179)

View on Semantic Scholar

Summary

The paper proposes a novel MT evaluation framework that reframes paraphrasing as a zero-shot translation task within a multilingual NMT model.
The methodology outperforms traditional benchmarks like BLEU and BERTscore on the WMT 2019 dataset by aligning MT output with human references.
The approach enables scalable, language-agnostic quality estimation without relying on human judgment, paving the way for real-time translation assessment.

Analysis of Machine Translation Evaluation through Zero-Shot Paraphrasing

This paper presents a novel approach for evaluating machine translation (MT) systems using a sequence-to-sequence paraphraser trained as a multilingual neural machine translation (NMT) system. By reframing paraphrasing as a zero-shot translation task, the authors aim to provide a robust metric for assessing translation quality without relying on human judgments during training. Their system, trained in 39 languages, demonstrates an ability to outperform or statistically tie with existing MT evaluation metrics across most language pairs in the WMT 2019 segment-level shared task.

Key Findings and Methodology

The primary innovation lies in treating sentential paraphrasing as zero-shot translation within a multilingual NMT architecture. By adopting this perspective, the paraphraser implicitly rewards MT output that closely aligns lexically and syntactically with human references, a reflection of what the authors term a "lexically/syntactically unbiased paraphraser". Notably, this approach circumvents the need for human judgment datasets in training, leveraging parallel bitext across multiple languages to establish a universal metric.

When tested on the WMT 2019 dataset, the multilingual model consistently surpasses traditional benchmarks such as BLEU, which have shown diminishing correlation with human judgment as MT systems improve. Prism-ref, the proposed metric based on referencing, demonstrates superior correlation in most language pairs compared to existing methods, including BERTscore and BLEURT. Furthermore, Prism-src, the source-conditioned variant for quality estimation (QE), significantly outperforms submissions in the WMT 2019 QE shared task without using reference data.

Implications and Future Directions

The implications of this research are profound, offering a scalable solution to the increasing complexity and diversity of MT evaluation. The model's ability to effectively assess strong MT systems suggests potential applications in real-time translation quality assessment, enabling rapid iteration and improvement without extensive human oversight.

Moving forward, the scope for refinement and expansion is substantial. The potential to extend this methodology to document-level evaluation aligns with current trends advocating for broader contextual considerations in translation. Moreover, as stronger multilingual models emerge, further gains in evaluation accuracy and efficiency are expected.

Conclusion

This paper represents a significant shift towards leveraging multilingual NMT models for automatic evaluation, offering a more resilient and versatile framework compared to legacy metrics. As multilingual training methods continue to evolve, their utility in creating robust, language-agnostic evaluation tools will likely catalyze advancements in both MT systems and broader NLP applications. The release of the model and toolkit paves the way for further exploration and collaboration within the MT research community.

Markdown Report Issue