Learning-Based Single-Document Summarization with Compression and Anaphoricity Constraints (1603.08887v2)

Published 29 Mar 2016 in cs.CL

Abstract: We present a discriminative model for single-document summarization that integrally combines compression and anaphoricity constraints. Our model selects textual units to include in the summary based on a rich set of sparse features whose weights are learned on a large corpus. We allow for the deletion of content within a sentence when that deletion is licensed by compression rules; in our framework, these are implemented as dependencies between subsentential units of text. Anaphoricity constraints then improve cross-sentence coherence by guaranteeing that, for each pronoun included in the summary, the pronoun's antecedent is included as well or the pronoun is rewritten as a full mention. When trained end-to-end, our final system outperforms prior work on both ROUGE as well as on human judgments of linguistic quality.

Citations (161)

View on Semantic Scholar

Summary

The paper introduces a learning-based approach for single-document summarization that incorporates sentence compression and anaphoricity constraints for improved coherence.
A discriminative model using sparse lexical features and ILP formulation integrates syntactic-discursive compression and coreference resolution constraints end-to-end.
The proposed full system achieved higher ROUGE scores and superior human judgments on linguistic quality compared to prior methods on the New York Times Annotated Corpus.

Overview of "Learning-Based Single-Document Summarization with Compression and Anaphoricity Constraints"

This paper introduces an advanced approach to the task of single-document summarization, focusing on the challenges of achieving both content compression and coherence through the integration of anaphoricity constraints. Authored by Durrett, Berg-Kirkpatrick, and Klein, the work seeks to push the boundaries of summarization quality by employing a discriminative model that learns to select important textual content based on rich lexical features.

Single-document summarization lacks the redundancy guides that multi-document summarization benefits from, making content selection particularly challenging. To address this, the authors trained their model on the New York Times Annotated Corpus, which provides a substantial dataset of around 100,000 news articles paired with abstractive summaries.

Model Composition and Constraints

The model constructed is discriminative, relying on sparse features whose weights are learned end-to-end on the corpus. Integral to the system is the concept of deletions or compression within sentences, driven by dependencies between sub-sentential units. The innovation lies in employing two formal frameworks for sentence compression—syntactic and discursive, which enhance the model’s expressiveness.

To ensure coherence, particularly cross-sentence coherence, anaphoricity constraints form a critical part of the model. These constraints guarantee that a pronoun’s antecedent is preserved within the summary or, alternatively, that the pronoun is rewritten as a full entity mention. This is accomplished through an ILP (Integer Linear Programming) formulation, which efficiently balances the complexities of these constraints with effective content extraction.

Experimental Methodology and Evaluation

The authors evaluated their model against prior methodologies on diverse performance metrics such as ROUGE scores and human judgments of linguistic clarity. Notably, their full system outperformed existing methods, including a document prefix baseline and a discourse-informed method, achieving higher ROUGE scores and improved judgments from human evaluators regarding linguistic quality.

This work highlights the efficacy of structural constraints from coreference and syntactic parses in maintaining fluency while extracting meaningful content. By leveraging end-to-end learning, the system significantly reduces reliance on heuristic approaches commonly seen in prior efforts.

Implications and Future Directions

This paper contributes robustly to both practical and theoretical understanding of single-document summarization. Practically, its model offers a framework that can adapt to large datasets, promising enhancements in summarization tasks critical in domains such as journalism and academia. Theoretically, this research underscores the importance of resolving anaphora not merely as a preprocessing step but integrally within the summarization process.

Looking forward, further exploration could focus on refining coreference mechanisms and compression rules to optimize content retention and coherence simultaneously. Additionally, the expansion to more diverse document genres beyond news articles could be valuable, testing the adaptability and robustness of the model.

In essence, this paper delivers a substantive advancement in summarization techniques, showcasing the potential of syntactic-discursive integrations and anaphora resolution methods in enhancing single-document summarization quality. This holds promising avenues for future work in the ever-evolving field of NLP and AI.