Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 30 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 12 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Why We Need New Evaluation Metrics for NLG (1707.06875v1)

Published 21 Jul 2017 in cs.CL

Abstract: The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

Citations (434)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that current automatic metrics (e.g., BLEU, ROUGE) often fail to capture sentence-level nuances compared to human evaluations.
  • It examines discrepancies across datasets and NLG systems, highlighting issues like scale mismatch and system-specific variability in metric performance.
  • The study calls for developing new, context-aware evaluation methods, including reference-less and discriminative models for more reliable assessments.

Necessity for New Evaluation Metrics in NLG

Recent advancements in Natural Language Generation (NLG) have led to the widespread adoption of automatic metrics for evaluating system performance, primarily due to their cost-effectiveness and rapid processing capabilities. However, this paper seeks to underscore the inadequacies of current metrics by analyzing their correlation with human judgments, thereby advocating for new evaluation methodologies.

Examination of Current Metrics

The paper focuses on Word-Based Metrics (WBMs) and Grammar-Based Metrics (GBMs), scrutinizing their ability to reflect human evaluations effectively in the context of end-to-end, data-driven NLG systems. Evaluated metrics include, but are not limited to, BLEU, ROUGE, METEOR, and SMATCH, alongside grammar indicators such as Flesch Reading Ease scores and Stanford Parser scores. The analysis spans multiple datasets and domains to ensure comprehensive insights. Figure 1

Figure 1: Spearman correlation results for on . Correlations between human ratings and automatic metrics are highlighted, with blue circles indicating positive and red indicating negative correlations. The circle size denotes correlation strength.

Data-Driven NLG Systems

Three distinct NLG systems are evaluated: RNNLG, TGEN, and JLOLS, each leveraging different methodologies for sentence planning and surface realization. The systems are tested across various datasets, with outputs compared against human references for informativeness, naturalness, and quality. Figure 2

Figure 2: Williams test results: The significance of correlation differences between metrics and human ratings.

Observations on Metric Performance

The results reveal significant discrepancies between automatic metric evaluations and human judgments. While automatic scores may align at the systemic level, they often fail to capture nuances at the sentence level. This misalignment is particularly pronounced in outputs with middle-range human ratings, where metrics inaccurately reflect user satisfaction. Furthermore, system performance is observed to be highly dataset-specific, suggesting that current metrics lack robustness across different contexts.

Human and Automatic Metric Correlation

The paper provides a detailed correlation analysis between human evaluations and various metrics. The analysis suggests that none of the metrics achieve a moderate correlation across the board, with outputs traditionally evaluated as either good or bad reflecting more accurate correlations than those rated average. Figure 3

Figure 3: Correlation between automatic metrics (WBMs) and human ratings, differentiated by informativeness levels.

Limitations of Current Approaches

The paper identifies several limitations inherent in current evaluation practices:

  • Assumption of Gold Standards: Current metrics presuppose that human references are correct and complete, which is frequently invalid, particularly in crowdsourced datasets where ungrammatical references may skew accuracy.
  • Scale Mismatch: The quantitative scale of metric outputs versus qualitative human judgments presents alignment challenges.
  • System-Specific Variability: The dependency of metrics on specific system architectures and datasets dilutes their reliability across broader applications.

Conclusions and Future Directions

The authors conclude that state-of-the-art automatic evaluation metrics inadequately reflect human evaluations, underscoring the necessity for human assessments in the development of NLG systems. Future directions include the exploration of reference-less evaluations and discriminative models, alongside enhancements to existing metrics to improve cross-domain performance.

In advancing the field, the integration of advanced contextual assessments and extrinsic evaluation methods offers promising avenues for developing more reliable, system-independent evaluation metrics for NLG technologies.