BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation (2103.11878v4)

Published 22 Mar 2021 in cs.CL and cs.AI

Abstract: Standard automatic metrics, e.g. BLEU, are not reliable for document-level MT evaluation. They can neither distinguish document-level improvements in translation quality from sentence-level ones, nor identify the discourse phenomena that cause context-agnostic translations. This paper introduces a novel automatic metric BlonDe to widen the scope of automatic MT evaluation from sentence to document level. BlonDe takes discourse coherence into consideration by categorizing discourse-related spans and calculating the similarity-based F1 measure of categorized spans. We conduct extensive comparisons on a newly constructed dataset BWB. The experimental results show that BlonDe possesses better selectivity and interpretability at the document-level, and is more sensitive to document-level nuances. In a large-scale human study, BlonDe also achieves significantly higher Pearson's r correlation with human judgments compared to previous metrics.

Citations (47)

View on Semantic Scholar

Summary

The paper introduces BlonDe, a novel metric that evaluates document-level MT by assessing discourse phenomena like inconsistency, ellipsis, and ambiguity.
It employs a similarity-based F1 measure and categorizes discourse spans to extend evaluation from isolated sentences to whole documents.
BlonDe outperforms traditional metrics with higher Pearson's correlation to human judgments, promoting more context-aware MT system development.

BlonDe: A Document-Level MT Evaluation Metric

The paper presents BlonDe, a novel metric designed to address the inadequacies of standard sentence-level metrics, such as BLEU, for evaluating document-level machine translation (MT). Standard metrics are insufficient for capturing document-level nuances, as they primarily focus on sentence-level evaluations and lack the capability to account for inter-sentential context and discourse phenomena. Recognizing this gap, the authors propose BlonDe, which emphasizes the evaluation of translation quality from a document-level perspective.

BlonDe offers a comprehensive evaluation by categorizing discourse-related spans and employing a similarity-based F1 measure across these categories. The authors extend the evaluation framework from isolated sentences to whole documents, thereby incorporating discourse coherence into the assessment. Key document-level phenomena addressed by BlonDe include inconsistency, ellipsis, and ambiguity, which are not typically captured by traditional metrics but are critical for a thorough assessment of MT quality at the document level.

Through experimentation on a newly constructed document-level dataset—Bilingual Web Book (bwb)—the researchers demonstrate BlonDe's effectiveness. The large bwb dataset, which spans multiple genres and contains over 9 million sentence pairs, highlights a substantial proportion of document-level translation errors. The authors categorize these errors and reveal that inconsistency (64.4%), ellipsis (20.3%), and ambiguity (7.3%) form a significant portion of translation mistakes.

BlonDe outperforms existing metrics by illustrating superior selectivity and interpretability. In human studies, BlonDe achieves a higher Pearson's correlation with human judgments compared to prior metrics, reinforcing its validity as a reliable tool for document-level MT evaluation. Furthermore, BlonDe's ability to evaluate pronouns, tenses, named entities, and discourse markers offers an enhanced perspective on translation quality that goes beyond the sentence level.

The paper also introduces BlonD-d and BlonD+, variants of BlonDe that further isolate document-specific translations and incorporate human annotations, respectively. This allows users to integrate human-evaluated discourse features seamlessly into BlonDe's framework, offering even greater flexibility and precision in translation evaluation.

The implications of this work are significant for the MT community. BlonDe provides a robust framework for evaluating MT systems in a manner that is more aligned with human judgment, particularly for document-level tasks, thereby encouraging the development of translation systems that better handle contextual dependencies. As MT approaches continue to evolve, metrics like BlonDe will be essential for accurately gauging progress toward producing translations that are coherent, cohesive, and contextually appropriate at the document level. Future work may involve further expansion of BlonDe to support additional languages and discourse phenomena, enhancing its applicability across diverse MT scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - EleanorJiang/BlonDe: Official implementations for (1) BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation and (2) Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus (77 stars)