- The paper introduces BLEURT, a BERT-based metric that enhances text generation evaluation through synthetic data pre-training.
- It fine-tunes BERT on limited human ratings using regression loss, achieving state-of-the-art performance in WMT Metrics shared tasks.
- BLEURT demonstrates robust adaptability across varied domains and quality drifts, consistently aligning its evaluations with human judgment.
BLEURT: Learning Robust Metrics for Text Generation
Introduction
The advancements in text generation (NLG) make accurate evaluation metrics increasingly vital. Traditional metric systems such as BLEU and ROUGE primarily focus on n-gram overlap and often correlate poorly with human judgment, especially as NLG systems improve over time. This paper introduces BLEURT, a BERT-based learned evaluation metric designed to model human judgments more effectively even with limited data. BLEURT applies a novel pre-training approach utilizing synthetic data to enhance robustness, achieving superior performance in the WMT Metrics shared tasks.
Fine-Tuning BERT for Evaluation
BLEURT leverages BERT's contextualized representations fine-tuned on human rating datasets. Given the small size of available training data, incorporating unsupervised pre-training results in a BERT model equipped to predict human-assigned quality ratings. The architecture adds a linear layer atop BERT's [CLS] token output to transform BERT's representations into a regression score. This setup, using a regression loss, demonstrates state-of-the-art performance in various NLG evaluation settings.
Synthetic Data Pre-Training
BLEURT's cornerstone is its exhaustive pre-training on synthetic data, aiming to prime BERT for evaluation tasks. This pre-training step uses various perturbation techniques, including BERT's mask-filling, backtranslation, and word dropout, to generate a diverse range of sentence pairs. These pairs are supplemented with pre-training signals such as BLEU, ROUGE, and BERTscore, amongst others, engaged in a multi-task loss framework. These tasks emphasize lexical and semantic differences, equipping BLEURT to handle domain and quality drifts effectively.
Experiments and Results
WMT Metrics Shared Tasks
BLEURT was tested across three years of WMT Metrics Shared Tasks, achieving top performance against the best participant systems. The results, consistent over multiple language pairs, highlight the robustness and accuracy of BLEURT both with a pre-training phase and in fine-tuning regimes. Importantly, pre-training showed notable improvements, particularly when the training datasets lacked i.i.d. characteristics.
Robustness to Quality Drift
In scenarios mimicking real-world rating distributions where past training ratings skewed negatively while testing skewed positively, BLEURT maintained strong correlations with human judgments. Pre-training was crucial here, enabling BLEURT to surpass baseline metrics such as sentBLEU and BERTscore, significantly under varying skew factors.
Adaptation to New Domains
BLEURT's adaptability to new domains was demonstrated using the WebNLG dataset. The metric quickly aligns with human judgments across aspects such as semantics, grammar, and fluency, using limited in-domain training data. The versatility and low-data fitting capacity of BLEURT ensure it's a robust choice across different tasks.
Conclusion
BLEURT's comprehensive design incorporating BERT with strategic pre-training on synthetic data establishes it as a powerful metric for NLG evaluation. Its ability to adapt and generalize under diverse conditions while preserving alignment with human judgment highlights its potential for widespread adoption. Future efforts could explore multilingual capabilities and hybrid methods combining human evaluations with automatic metrics.