SciBERT: A Pretrained Language Model for Scientific Text

Published 26 Mar 2019 in cs.CL | (1903.10676v3)

Abstract: Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained LLM based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (2,689)

View on Semantic Scholar

Summary

The paper introduces SciBERT, a specialized language model pretrained on a large corpus of scientific texts using a custom SciVocab.
It demonstrates significant improvements, including a mean F1 score increase of +2.11 with finetuning and state-of-the-art results in biomedical and computer science tasks.
The study underscores the importance of domain-specific pretraining and vocabulary adaptation for enhancing NLP performance in scientific literature.

SciBERT: A Pretrained LLM for Scientific Text

Beltagy et al. have introduced SciBERT, a LLM specifically pre-trained on scientific text, built to address the gap left by general-purpose models like BERT and ELMo, which are trained predominantly on non-scientific corpora. This work is presented to improve the performance of NLP tasks within scientific domains, where annotated data is hard to obtain and expensive to produce.

Architecture and Training Corpus

SciBERT utilizes the same architectural foundation as BERT: a multilayer bidirectional Transformer. However, the key differentiation lies in the pretraining corpus and the vocabulary. SciBERT has been trained on a large-scale, multi-domain corpus consisting of 1.14 million scientific papers sourced from the Semantic Scholar database, covering both biomedical (82%) and computer science (18%) fields. The corpus size is approximately 3.17 billion tokens, comparable to the dataset BERT was originally trained on.

The tokenization follows the WordPiece method, but unlike BERT's BaseVocab, SciBERT introduces SciVocab, which is generated specifically from the scientific texts. The resulting overlap with BaseVocab is only 42%, indicative of the custom scientific vocabulary's significance. This tailored vocabulary provides a nuanced understanding of domain-specific terms which are crucial for enhancing NLP task performance in the scientific field.

Evaluation and Results

SciBERT's effectiveness was assessed across several NLP tasks, namely Named Entity Recognition (NER), PICO Extraction (PICO), Text Classification (CLS), Relation Classification (REL), and Dependency Parsing (DEP). These tasks spanned a variety of datasets from different scientific disciplines. The evaluation involved both finetuning and using frozen embeddings.

Key Findings:

Performance Improvement: SciBERT achieves superior performance compared to BERT-Base, recording a mean performance increase (measured in F1 score) of +2.11 with finetuning and +2.43 without finetuning. This improvement underscores the effectiveness of using a domain-specific pretrained model.
Biomedical Domain: When tested on biomedical tasks, SciBERT not only outperformed BERT-Base but also established new state-of-the-art (SOTA) results in datasets like BC5CDR (NER) and ChemProt (REL). Notably, these improvements validated the efficacy of both the pretraining corpus and the SciVocab in handling domain-specific tasks.
Computer Science Domain: For computer science tasks, SciBERT again demonstrated impressive performance gains over BERT-Base, achieving SOTA results on datasets such as ACL-ARC (CLS) and the NER portion of SciERC.
Multi-domain Tasks: On multi-domain datasets, SciBERT consistently outperformed BERT-Base, albeit the margin was smaller compared to the individual domain-specific tasks.

Discussion

One significant observation from this work is the importance of fine-tuning. Results indicated that fine-tuning SciBERT on specific tasks generally yielded better results than using task-specific architectures atop frozen embeddings.

The development of SciVocab, a scientific domain-specific vocabulary, contributed positively to the performance, although the majority of the gains could still be attributed to pretraining on the scientific corpus.

Practical and Theoretical Implications

SciBERT's introduction holds considerable practical implications, particularly for the scientific community where NLP tools are increasingly leveraged to manage the growing volume of publications. By improving the performance on key NLP tasks, SciBERT facilitates more accurate and efficient information extraction, classification, and parsing from scientific texts, thereby aiding researchers in knowledge discovery and synthesis.

Theoretically, this work reinforces the importance of domain-specific pretraining. It also highlights the value of vocabulary adaptation to domain-specific contexts, a factor often understated in general NLP model development.

Future Directions

The authors suggest potential avenues for future development, such as the release of a larger version of SciBERT analogous to BERT-Large. Given the cost and resource intensity involved in training such models, another focus could be optimizing the proportion of different domain papers in the training corpus to cater to a wider range of scientific fields effectively.

In summary, SciBERT sets a pivotal benchmark for domain-specific LLMs, showcasing the profound impact of tailored pretraining on scientific NLP tasks. This contribution not only elevates the NLP capabilities in the scientific domain but also paves the way for future advancements in specialized LLM pretraining.

Markdown Report Issue