Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 169 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

BLEU is Not Suitable for the Evaluation of Text Simplification (1810.05995v1)

Published 14 Oct 2018 in cs.CL

Abstract: BLEU is widely considered to be an informative metric for text-to-text generation, including Text Simplification (TS). TS includes both lexical and structural aspects. In this paper we show that BLEU is not suitable for the evaluation of sentence splitting, the major structural simplification operation. We manually compiled a sentence splitting gold standard corpus containing multiple structural paraphrases, and performed a correlation analysis with human judgments. We find low or no correlation between BLEU and the grammaticality and meaning preservation parameters where sentence splitting is involved. Moreover, BLEU often negatively correlates with simplicity, essentially penalizing simpler sentences.

Citations (185)

Summary

  • The paper reveals that BLEU inadequately scores structural text simplification, particularly penalizing effective sentence splitting.
  • The study introduces the HSplit corpus to benchmark BLEU against human judgments and alternative metrics like SARI and FK.
  • Experimental results show BLEU's negative correlation with output simplicity, grammaticality, and meaning preservation in TS tasks.

BLEU is Not Suitable for the Evaluation of Text Simplification

Introduction

The paper argues that BLEU, a prominent evaluation metric for text-to-text generation tasks including machine translation (MT), is inadequate for evaluating text simplification (TS), particularly when it involves structural operations like sentence splitting. Despite its widespread adoption, BLEU's reliance on n-gram overlap leads to misalignment with human judgments of simplicity, grammaticality, and meaning preservation in sentence splitting tasks. This inadequacy culminates in BLEU often inversely correlating with the desired simplicity of the output, thus penalizing systems producing simplified forms.

Critique of BLEU in Text Simplification

Text simplification encompasses both lexical and structural adjustments. BLEU's effectiveness in TS is limited due to its n-gram-centric scoring system that inherently favors lexical similarity. The paper highlights prior findings [Xu16] that revealed BLEU's proclivity for assigning higher scores to outputs closely resembling the original inputs, thus failing to reward successful simplification efforts that deviate structurally, such as in sentence splitting.

The paper introduces HSplit, a human-generated gold standard corpus focused on sentence splitting, to rigorously test BLEU's applicability by contrasting BLEU scores with human evaluation metrics. BLEU exhibited low or negative correlation with grammaticality and meaning preservation when sentence splitting occurred, and notably negative correlation with simplicity, underpinning the contention that BLEU's current framework inadequately addresses structural TS.

Analysis Through HSplit Corpus

The HSplit corpus, designed explicitly for this paper, provides a robust basis for a comprehensive analysis of BLEU's deficiencies in TS evaluation. By compiling manual annotations from native English speakers, the authors ensured that the corpus aligns with human expectations of sentence split outputs. Two sets of guidelines directed the corpus construction: one emphasizing maximal grammatical splitting, and another prioritizing simplifications where beneficial. The results show significant BLEU score drops even for human-standard outputs, reinforcing BLEU's inability to effectively appraise TS modifications.

Experimental Metrics and Human Correlation

Alternative metrics such as iBLEU, Flesch-Kincaid (FK), and SARI were assessed alongside BLEU. Significantly, the Levenshtein distance to the source (LDSC_{SC}) correlated more consistently with human judgments of meaning preservation and grammaticality. This comparison highlights BLEU's shortfall as a metric for TS assessment and illustrates more robust alternatives in specific structural contexts.

In both standard and split-reference evaluation settings, BLEU's consistent underperformance, especially against systems explicitly targeting structural simplification, further informs its unsuitability for TS evaluation, especially in scenarios advocating sentence transformation like Split-and-Rephrase tasks.

Conclusion

The findings underscore BLEU's intricate relationship with TS systems that prioritize structure over lexical fidelity, thereby deeming it an unreliable measure for TS performance. This paper calls for refined metrics or the enhancement of existing ones like SARI and FK to better encapsulate structural modification goals intrinsic to TS, advocating the integration of human-informed judgments within automated evaluation frameworks. The exploration of alternatives will be pivotal for advancing TS research, predominantly focusing on structural simplification strategies.

The research encourages the community to consider new evaluation paradigms for TS that encapsulate diverse output qualities without over-relying on traditional lexical overlap metrics, presenting a necessity for evolution within standard evaluation regimes.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.