Better Summarization Evaluation with Word Embeddings for ROUGE (1508.06034v1)

Published 25 Aug 2015 in cs.CL and cs.IR

Abstract: ROUGE is a widely adopted, automatic evaluation measure for text summarization. While it has been shown to correlate well with human judgements, it is biased towards surface lexical similarities. This makes it unsuitable for the evaluation of abstractive summarization, or summaries with substantial paraphrasing. We study the effectiveness of word embeddings to overcome this disadvantage of ROUGE. Specifically, instead of measuring lexical overlaps, word embeddings are used to compute the semantic similarity of the words used in summaries instead. Our experimental results show that our proposal is able to achieve better correlations with human judgements when measured with the Spearman and Kendall rank coefficients.

Citations (159)

View on Semantic Scholar

Summary

The paper integrates word embeddings into ROUGE to assess semantic similarity, overcoming the limitations of strict lexical matching.
It introduces ROUGE-WE variants using word2vec and a compositional approach to handle out-of-vocabulary terms.
Experimental findings on the AESOP dataset demonstrate improved correlation with human judgments using Pearson, Spearman, and Kendall coefficients.

Integrating Word Embeddings into ROUGE for Improved Summarization Evaluation

Introduction

The paper "Better Summarization Evaluation with Word Embeddings for ROUGE" (1508.06034) addresses the inherent bias in the ROUGE metric towards lexical similarity, which detracts from its effectiveness in evaluating abstractive summaries. Standard ROUGE relies on n-gram overlap and fails to capture semantic similarities between expressions. This research proposes integrating word embeddings to enhance ROUGE by considering semantic content, thereby improving its correlation with human judgment.

Methodology

The core contribution of the paper is the adaptation of ROUGE to incorporate semantic similarity through word embeddings, particularly word2vec. The new metric, named ROUGE-WE, introduces a similarity function $f_{WE}$ that leverages pre-trained word vectors to evaluate semantic congruence. This adaptation allows for the calculation of ROUGE-WE-1, ROUGE-WE-2, and ROUGE-WE-SU4, analogous to the traditional ROUGE metrics. Handling out-of-vocabulary (OOV) terms is accomplished through a compositional approach based on the multiplicative combination of individual word vectors.

Experimentation and Results

Experiments were conducted using the AESOP dataset from the TAC summarization task, employing correlation measurements like Pearson, Spearman, and Kendall ranking coefficients. The results indicated that ROUGE-WE-1 improves the correlation between automatic and human evaluations, particularly for semantic judgments, as evidenced by superior Spearman and Kendall coefficients compared to the traditional ROUGE. For instance, ROUGE-WE-1 outperformed other metrics in terms of Spearman rank with pyramid scores. While ROUGE-WE-2 showed improvements, ROUGE-WE-SU4 lagged potentially due to the limitations of its compositional modeling for skip-bigrams.

Comparison with Previous Work

When placed alongside other evaluation systems from AESOP 2011, ROUGE-WE-1 emerged as a frontrunner in achieving high correlation with human judgment—eclipsing systems like C_S_IIITH3 and BE-HM in Spearman coefficient for pyramid scores. This underscores the paper's success in enhancing ROUGE's utility, particularly in its adaptability to handle non-lexical similarity.

Future Work and Implications

Progressing forward, the authors propose expanding the evaluation to include more diverse summarization systems that embody substantial paraphrasing to measure the full utility of ROUGE-WE. Further, refining the composition model for embeddings by adopting advanced techniques could bolster ROUGE-WE's efficacy, especially concerning bigram and skip-bigram analysis. The enhanced evaluative capacity of ROUGE-WE presents substantial promise in supporting the development of sophisticated and semantically aware summarization systems.

Conclusion

The paper effectively demonstrates that integrating word embeddings into ROUGE results in a metric that aligns more closely with human evaluations by overcoming lexical bias. ROUGE-WE represents a significant advancement in automatic summarization evaluation. The extension and refinement of this approach can potentially serve broader applications, ultimately fostering the creation of richer, more abstractive text summarization methodologies.