Fine-tune BERT for Extractive Summarization (1903.10318v2)

Published 25 Mar 2019 in cs.CL

Abstract: BERT, a pre-trained Transformer model, has achieved ground-breaking performance on multiple NLP tasks. In this paper, we describe BERTSUM, a simple variant of BERT, for extractive summarization. Our system is the state of the art on the CNN/Dailymail dataset, outperforming the previous best-performed system by 1.65 on ROUGE-L. The codes to reproduce our results are available at https://github.com/nlpyang/BertSum

Citations (458)

View on Semantic Scholar

Summary

The paper introduces BERTSUM, which adapts BERT with sentence-level tokens and interval segment embeddings for extractive summarization.
It demonstrates that using two Transformer layers significantly boosts ROUGE-L scores by 1.65 points on benchmark datasets.
The study highlights that innovations like interval embeddings and Trigram Blocking effectively reduce redundancy in generated summaries.

Introduction

The paper under discussion investigates the viability of fine-tuning BERT, a pre-trained Transformer model, for the task of extractive summarization. Extractive summarization is the process of producing a concise version of a document by selecting and concatenating its most salient sentences. The paper introduces BERTSUM, a variant of BERT tailored to this task, and its performance is evaluated against other state-of-the-art systems using ROUGE metrics.

Methodology

BERT, by design, is adept at handling tasks with a lot of contextual information due to its pre-training on sizable datasets. However, BERT’s architecture necessitates modification for the task at hand. The authors resolve this by introducing sentence-level [CLS] tokens and interval segment embeddings to adapt BERT's token-oriented nature to sentence-oriented extractive summarization. The paper outlines summarization layers built on top of BERT's outputs, designed to capture document-level features. Specific attention is paid to three variants: a simple classifier, inter-sentence Transformer, and an LSTM-based approach.

Experiments and Results

A range of experiments were conducted on two well-known datasets: CNN/Dailymail and the New York Times Annotated Corpus. Noteworthy is the fact that the BERTSUM with Transformer layers significantly outperformed previously established models, increasing the previous best ROUGE-L score by 1.65 points. It was found that using two Transformer layers atop BERT was particularly effective. Furthermore, simplicity proved effective too, as the use of LSTM layers did not contribute substantially to the overall summarization performance.

Additions like interval segment embeddings provided performance boosts, while the application of Trigram Blocking further refined the output by reducing content redundancy in the generated summaries. This indicates that both the structural refinements to BERT and the post-processing steps are instrumental in pushing the envelope in extractive summarization tasks.

Conclusion

The research demonstrates that BERT, with appropriate modifications and fine-tuning, sets a new benchmark in extractive summarization. The findings underscore the impact of leveraging a powerful pre-trained model like BERT, and enhancing it with targeted alterations tailored to the specific needs of the summarization endeavor. The success of BERTSUM with Transformer layers, in particular, affirms the value of this strategy, setting the stage for future research to explore and expand upon these modifications. The implementation details and code made available by the authors ensure that the research community can replicate and build upon these promising results.

PDF Markdown

GitHub

GitHub - nlpyang/BertSum: Code for paper Fine-tune BERT for Extractive Summarization (1,473 stars)