- The paper reveals that BLEU inadequately scores structural text simplification, particularly penalizing effective sentence splitting.
- The study introduces the HSplit corpus to benchmark BLEU against human judgments and alternative metrics like SARI and FK.
- Experimental results show BLEU's negative correlation with output simplicity, grammaticality, and meaning preservation in TS tasks.
BLEU is Not Suitable for the Evaluation of Text Simplification
Introduction
The paper argues that BLEU, a prominent evaluation metric for text-to-text generation tasks including machine translation (MT), is inadequate for evaluating text simplification (TS), particularly when it involves structural operations like sentence splitting. Despite its widespread adoption, BLEU's reliance on n-gram overlap leads to misalignment with human judgments of simplicity, grammaticality, and meaning preservation in sentence splitting tasks. This inadequacy culminates in BLEU often inversely correlating with the desired simplicity of the output, thus penalizing systems producing simplified forms.
Critique of BLEU in Text Simplification
Text simplification encompasses both lexical and structural adjustments. BLEU's effectiveness in TS is limited due to its n-gram-centric scoring system that inherently favors lexical similarity. The paper highlights prior findings [Xu16] that revealed BLEU's proclivity for assigning higher scores to outputs closely resembling the original inputs, thus failing to reward successful simplification efforts that deviate structurally, such as in sentence splitting.
The paper introduces HSplit, a human-generated gold standard corpus focused on sentence splitting, to rigorously test BLEU's applicability by contrasting BLEU scores with human evaluation metrics. BLEU exhibited low or negative correlation with grammaticality and meaning preservation when sentence splitting occurred, and notably negative correlation with simplicity, underpinning the contention that BLEU's current framework inadequately addresses structural TS.
Analysis Through HSplit Corpus
The HSplit corpus, designed explicitly for this paper, provides a robust basis for a comprehensive analysis of BLEU's deficiencies in TS evaluation. By compiling manual annotations from native English speakers, the authors ensured that the corpus aligns with human expectations of sentence split outputs. Two sets of guidelines directed the corpus construction: one emphasizing maximal grammatical splitting, and another prioritizing simplifications where beneficial. The results show significant BLEU score drops even for human-standard outputs, reinforcing BLEU's inability to effectively appraise TS modifications.
Experimental Metrics and Human Correlation
Alternative metrics such as iBLEU, Flesch-Kincaid (FK), and SARI were assessed alongside BLEU. Significantly, the Levenshtein distance to the source (LDSC) correlated more consistently with human judgments of meaning preservation and grammaticality. This comparison highlights BLEU's shortfall as a metric for TS assessment and illustrates more robust alternatives in specific structural contexts.
In both standard and split-reference evaluation settings, BLEU's consistent underperformance, especially against systems explicitly targeting structural simplification, further informs its unsuitability for TS evaluation, especially in scenarios advocating sentence transformation like Split-and-Rephrase tasks.
Conclusion
The findings underscore BLEU's intricate relationship with TS systems that prioritize structure over lexical fidelity, thereby deeming it an unreliable measure for TS performance. This paper calls for refined metrics or the enhancement of existing ones like SARI and FK to better encapsulate structural modification goals intrinsic to TS, advocating the integration of human-informed judgments within automated evaluation frameworks. The exploration of alternatives will be pivotal for advancing TS research, predominantly focusing on structural simplification strategies.
The research encourages the community to consider new evaluation paradigms for TS that encapsulate diverse output qualities without over-relying on traditional lexical overlap metrics, presenting a necessity for evolution within standard evaluation regimes.