SpanBERT: Improving Pre-training by Representing and Predicting Spans

Published 24 Jul 2019 in cs.CL and cs.LG | (1907.10529v3)

Abstract: We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. SpanBERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT-large, our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0, respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6\% F1), strong performance on the TACRED relation extraction benchmark, and even show gains on GLUE.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (1,867)

View on Semantic Scholar

Summary

The paper introduces span-level masking and a span-boundary objective to better capture contiguous text spans during pre-training.
Experiments demonstrate significant improvements, including 3.3%-5.4% F1 gains on SQuAD and a 6.6-point boost in coreference resolution.
Ablation studies confirm that randomly masking contiguous spans consistently enhances performance across various NLP benchmarks.

SpanBERT: Improving Pre-training by Representing and Predicting Spans

SpanBERT presents an innovative approach to pre-training by refining the representation and prediction of text spans. This method extends the original BERT by introducing modifications in both its masking scheme and training objectives, particularly targeting tasks that involve span-level reasoning such as question answering and coreference resolution.

Key Contributions

SpanBERT introduces two primary modifications:

Masking Contiguous Spans: Rather than masking individual random tokens, SpanBERT masks contiguous spans of text. This approach forces the model to predict entire spans based on the surrounding context rather than relying on individual token predictions.
Span-Boundary Objective (SBO): This novel objective trains the model to predict the content of a masked span using only the representations of its boundary tokens. This reinforces the model's ability to encode span-level information, which can be efficiently accessed during the fine-tuning phase.

Experimental Results

SpanBERT's efficacy is demonstrated across various NLP benchmarks:

SQuAD 1.1 and 2.0: SpanBERT achieves 94.6% F1 on SQuAD 1.1 and 88.7% F1 on SQuAD 2.0, outperforming BERT by 3.3% and 5.4%, respectively.
OntoNotes Coreference Resolution: SpanBERT sets a new state-of-the-art on this task with a score of 79.6% F1, an improvement of 6.6 percentage points over the previous best model.
TACRED Relation Extraction: The model attains 70.8% F1, demonstrating strong performance against benchmarks.
GLUE Benchmark: SpanBERT shows improvements in tasks such as QNLI and RTE, with QNLI accuracy reaching 94.3% and RTE improving by 6.9% over the baseline, resulting in an overall GLUE average increase to 82.8%.

Comparative Baselines

Three BERT variants were used as baselines for comparison:

Google BERT: The original pre-trained models reported by Devlin et al.
Our BERT: A reimplementation with improved preprocessing and optimization.
Our BERT-1seq: A single-sequence trained version without the next sentence prediction (NSP) task.

Observations from Ablation Studies

Ablation studies highlight the advantages of SpanBERT's design choices:

Masking Schemes: Random span masking outperformed linguistically-informed schemes (e.g., named entities, noun phrases) in most tasks, underscoring the robustness of random span selection.
Auxiliary Objectives: Removing NSP and employing single-sequence training generally yielded better results. Additionally, integrating the SBO with span masking consistently improved performance across tasks, particularly in coreference resolution.

Theoretical and Practical Implications

SpanBERT's advancements emphasize the importance of effective pre-training strategies for enhancing downstream task performance. By focusing on span-level pre-training, SpanBERT not only improves accuracy in span-intensive tasks but also shows general applicability across diverse NLP benchmarks.

Future Directions

Several potential avenues can be explored based on SpanBERT's contributions:

Broader Application: Applying span-based pre-training to other types of spans such as syntactic structures or semantic roles may uncover further performance gains.
Cross-lingual Pre-training: Extending the span-based pre-training approach to multilingual corpora could enhance cross-lingual understanding and performance.
Large-scale Training: Leveraging larger corpora and increased computational resources could further elevate the performance ceilings observed with SpanBERT.

Conclusion

SpanBERT proposes a refined approach to pre-training that effectively captures and utilizes span-level information, showcasing significant improvements across various NLP tasks. The method's design not only advances the state-of-the-art in span-related benchmarks but also provides a strong foundation for future research in pre-trained LLMs.

Markdown Report Issue