Papers
Topics
Authors
Recent
2000 character limit reached

BLEURT: Learning Robust Metrics for Text Generation (2004.04696v5)

Published 9 Apr 2020 in cs.CL

Abstract: Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

Citations (1,337)

Summary

  • The paper introduces BLEURT, a BERT-based metric that enhances text generation evaluation through synthetic data pre-training.
  • It fine-tunes BERT on limited human ratings using regression loss, achieving state-of-the-art performance in WMT Metrics shared tasks.
  • BLEURT demonstrates robust adaptability across varied domains and quality drifts, consistently aligning its evaluations with human judgment.

BLEURT: Learning Robust Metrics for Text Generation

Introduction

The advancements in text generation (NLG) make accurate evaluation metrics increasingly vital. Traditional metric systems such as BLEU and ROUGE primarily focus on n-gram overlap and often correlate poorly with human judgment, especially as NLG systems improve over time. This paper introduces BLEURT, a BERT-based learned evaluation metric designed to model human judgments more effectively even with limited data. BLEURT applies a novel pre-training approach utilizing synthetic data to enhance robustness, achieving superior performance in the WMT Metrics shared tasks.

Fine-Tuning BERT for Evaluation

BLEURT leverages BERT's contextualized representations fine-tuned on human rating datasets. Given the small size of available training data, incorporating unsupervised pre-training results in a BERT model equipped to predict human-assigned quality ratings. The architecture adds a linear layer atop BERT's [CLS] token output to transform BERT's representations into a regression score. This setup, using a regression loss, demonstrates state-of-the-art performance in various NLG evaluation settings.

Synthetic Data Pre-Training

BLEURT's cornerstone is its exhaustive pre-training on synthetic data, aiming to prime BERT for evaluation tasks. This pre-training step uses various perturbation techniques, including BERT's mask-filling, backtranslation, and word dropout, to generate a diverse range of sentence pairs. These pairs are supplemented with pre-training signals such as BLEU, ROUGE, and BERTscore, amongst others, engaged in a multi-task loss framework. These tasks emphasize lexical and semantic differences, equipping BLEURT to handle domain and quality drifts effectively.

Experiments and Results

WMT Metrics Shared Tasks

BLEURT was tested across three years of WMT Metrics Shared Tasks, achieving top performance against the best participant systems. The results, consistent over multiple language pairs, highlight the robustness and accuracy of BLEURT both with a pre-training phase and in fine-tuning regimes. Importantly, pre-training showed notable improvements, particularly when the training datasets lacked i.i.d. characteristics.

Robustness to Quality Drift

In scenarios mimicking real-world rating distributions where past training ratings skewed negatively while testing skewed positively, BLEURT maintained strong correlations with human judgments. Pre-training was crucial here, enabling BLEURT to surpass baseline metrics such as sentBLEU and BERTscore, significantly under varying skew factors.

Adaptation to New Domains

BLEURT's adaptability to new domains was demonstrated using the WebNLG dataset. The metric quickly aligns with human judgments across aspects such as semantics, grammar, and fluency, using limited in-domain training data. The versatility and low-data fitting capacity of BLEURT ensure it's a robust choice across different tasks.

Conclusion

BLEURT's comprehensive design incorporating BERT with strategic pre-training on synthetic data establishes it as a powerful metric for NLG evaluation. Its ability to adapt and generalize under diverse conditions while preserving alignment with human judgment highlights its potential for widespread adoption. Future efforts could explore multilingual capabilities and hybrid methods combining human evaluations with automatic metrics.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.