Why We Need New Evaluation Metrics for NLG (1707.06875v1)

Published 21 Jul 2017 in cs.CL

Abstract: The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

Citations (434)

View on Semantic Scholar

Summary

The paper demonstrates that word-overlap metrics such as BLEU do not reliably match human sentence-level evaluations.
It rigorously compares 21 metrics across various systems and datasets to highlight inherent limitations in current methodologies.
The study calls for innovative, context-aware, and reference-less evaluation methods to better assess NLG output quality.

Analysis of the Need for Novel Evaluation Metrics in NLG

In the paper titled "Why We Need New Evaluation Metrics for NLG," the authors Novikova et al. rigorously analyze the effectiveness of current automatic evaluation metrics for Natural Language Generation (NLG) systems, specifically criticizing the widely used word-overlap-based metrics. They contend that these metrics, including BLEU, TER, ROUGE, and others, fail to correlate robustly with human judgment, particularly at the sentence level.

Core Investigation and Findings

The authors conduct a comprehensive paper comparing 21 automated metrics across multiple NLG systems, datasets, and domains. The paper concludes that:

Word-based Metrics (WBMs) such as BLEU are particularly ineffective at mirroring human ratings for sentence-level evaluations. While they achieve some level of reliability at the system level, they falter significantly when it comes to capturing the nuanced judgment of human evaluators based on metrics such as informativeness, naturalness, and quality.
Grammar-based Metrics (GBMs), encompassing readability and grammaticality measures, present an alternative by focusing on intrinsic properties of the generated text rather than direct comparisons with reference texts. However, they are also not devoid of shortcomings as they can be influenced by manipulations aimed at boosting grammar scores without improving data relevance or informativeness.
The correlation between automatic metrics and human judgment is system- and dataset-specific, emphasizing the necessity of context-aware evaluation methods.

Implications and Future Directions

The research demonstrates a pressing need for more sophisticated evaluation methodologies in NLG. This necessity is driven by the inadequacies of current metrics, which can result in evaluations that do not reflect the true effectiveness of NLG systems. For practical and theoretical advancements in the field, the development of metrics that align better with human quality judgments is crucial. Future research directions may include the exploration of:

Context-sensitive evaluation, where metrics evaluate generated text within the conversational or narrative context rather than standalone sentences, which could provide a deeper understanding of content relevance.
Extrinsic evaluation metrics that focus on task success, integrating the outputs' effectiveness in real-world scenarios.
Reference-less metrics that assess semantic content fidelity and overall quality without relying on predefined human-written examples. These could be based on sophisticated machine learning techniques, such as neural networks, to estimate the quality of outputs from the semantic representation directly.

Conclusion

This paper highlights crucial limitations in existing evaluation practices for NLG and calls for innovative approaches that better reflect human judgment. While current automatic metrics offer convenience and speed, they lack the robustness needed for accurate and reliable performance assessment of NLG systems. The pursuit of more nuanced and multi-faceted evaluation strategies is imperative for advancing both the performance assessment and the development of NLG technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/TuliMathieu/status/1791199402259783895

https://twitter.com/WiAIR_podcast/status/1913274187067596809