Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

Published 10 Oct 2022 in cs.CL and cs.AI | (2210.05035v2)

Abstract: Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation, by utilizing a novel, iterative error synthesis and severity scoring pipeline. This pipeline applies a series of plausible errors to raw text and assigns severity labels by simulating human judgements with entailment. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE outperforms all prior unsupervised metrics on multiple diverse NLG tasks including machine translation, image captioning, and WebNLG text generation. For WMT 20/21 En-De and Zh-En, SESCORE improve the average Kendall correlation with human judgement from 0.154 to 0.195. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.

Abstract PDF Upgrade to Chat

Citations (20)

View on Semantic Scholar

Summary

The paper introduces SEScore, a novel evaluation metric leveraging stratified error synthesis to simulate varying error severities.
It employs controlled operations like insertion, deletion, substitution, and swap to generate synthetic error data that aligns with human judgment.
Experimental results show SEScore’s competitive performance on tasks like machine translation and image captioning, improving correlation with human evaluations.

An Expert Review on "Not All Errors Are Equal: Learning Text Generation Metrics using Stratified Error Synthesis"

The paper "Not All Errors Are Equal: Learning Text Generation Metrics using Stratified Error Synthesis" presents an innovative approach for developing a robust and generalizable evaluation metric for natural language generation (NLG) tasks. The technical core of the research is the introduction of SEScore, a model-based metric that circumvents the need for extensive human annotations by leveraging a stratified error synthesis and severity scoring pipeline.

Methodological Overview

The authors critique existing metrics for their reliance on human judgement data and limited applicability across diverse NLG tasks. They posit that traditional n-gram-based evaluation methods (like BLEU and ROUGE) inadequately capture the nuances of human judgement due to their sensitivity to lexical variations. To address this, SEScore generates synthetic "reference, candidate, score" triples using a stratified error synthesis mechanism, which applies plausible errors of varying severity to raw text. This stratified process not only ensures diversity in error types but also mimics human-perceived error severity through entailment-based severity scoring.

The stratified error synthesis comprises operations like insertion, deletion, substitution, and swap — each designed to simulate different error types such as omission, mistranslation, or grammatical inaccuracies. The severity scoring step assigns numerical labels to these simulated errors, reflecting their impact on perceived sentence quality, thus allowing SEScore to pretrain a quality prediction model effectively.

Experimental Validation

The paper validates SEScore against multiple NLG tasks including machine translation (WMT 2020/2021), data-to-text (WebNLG), and image captioning (COCO). In machine translation tasks, SEScore surpasses unsupervised metrics (BERTScore and PRISM) and approaches the performance of supervised metrics like COMET, despite not using human-annotated training data. For instance, SEScore achieves an improvement in Kendall correlation with human judgement from 0.154 to 0.195 for the WMT 20/21 Zh-En translation tasks.

Implications and Future Directions

This work holds significant implications for the field of NLG evaluation. Practically, it offers a scalable and domain-agnostic method to generate training data for evaluation metrics, potentially reducing costs and time associated with human annotation. Theoretically, it provides evidence for the efficacy of stratified synthetic data generation in capturing error severity, suggesting that future work could explore extensions of this method to other areas of language processing, such as dialogue systems or summarization tasks.

Furthermore, the stratified error synthesis framework opens avenues for research into more refined error categorization and severity assessment models, leveraging advanced techniques in entailment and semantic similarity. Future developments in this area may focus on extending the robustness of SEScore across languages with scarce resources and further improving its alignment with varied human judgement scales in diverse applications.

In conclusion, the SEScore framework underscores the importance of nuanced error modeling in automatic evaluation metrics, challenging the community to rethink traditional reliance on human data and pushing the boundaries towards more autonomous and scalable evaluation systems in the AI landscape.

Markdown Report Issue