LinkBERT: Pretraining Language Models with Document Links (2203.15827v1)

Published 29 Mar 2022 in cs.CL and cs.LG

Abstract: LLM (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked LLMing and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data at https://github.com/michiyasunaga/LinkBERT.

Authors (3)

Michihiro Yasunaga (48 papers)
Jure Leskovec (233 papers)
Percy Liang (239 papers)

Citations (317)

View on Semantic Scholar

Summary

The paper introduces a novel pretraining method that integrates linked documents using masked language modeling and document relation prediction.
It demonstrates substantial performance improvements on multi-hop reasoning tasks in both general and biomedical domains.
The approach paves the way for versatile graph-augmented training frameworks, enhancing retrieval-augmented systems and scientific applications.

LinkBERT: Pretraining LLMs with Document Links

In the paper "LinkBERT: Pretraining LLMs with Document Links," the authors introduce a novel pretraining method for LLMs (LMs) that leverages inter-document links to enhance performance on downstream tasks. The focus is to extend existing LM capabilities beyond isolated documents by capturing interconnected knowledge through links such as hyperlinks in general domains or citation links in biomedical literature.

Core Methodology

The authors propose LinkBERT, a strategy that treats a text corpus as a graph where nodes represent documents and edges embody links between them. This approach integrates linked documents into LM inputs during the pretraining phase, supplementing the conventional methods that typically only consider single document contexts.

Two primary self-supervised objectives are utilized in this framework:

Masked LLMing (MLM): Similar to BERT, MLM encourages the model to predict masked tokens within the input sequence, now expanded to include context from linked documents.
Document Relation Prediction (DRP): This novel objective trains the model to classify the relationship between document pairs (e.g., whether they are contiguous, linked, or random), facilitating a deeper understanding of document relevance and relations.

These methods collectively enable the model to internalize expanded knowledge across documents, thus enhancing reasoning and comprehension capabilities.

Empirical Results

The effectiveness of LinkBERT is demonstrated through substantial performance improvements on several NLP tasks, particularly in contexts requiring multi-hop reasoning and comprehension across multiple documents. Key findings include:

General Domain: Pretrained on Wikipedia with hyperlinks, LinkBERT shows marked improvements over BERT on the MRQA benchmark and GLUE tasks, particularly excelling in datasets like HotpotQA and TriviaQA, which require reasoning with multiple sources.
Biomedical Domain: Using PubMed with citation links, the biomedical variant, BioLinkBERT, sets new performance standards on the BLURB benchmark and MedQA-USMLE, accentuating its superior capacity for handling domain-specific, knowledge-intensive tasks.

Implications and Future Directions

The approach put forth in LinkBERT offers significant implications for the development of LLMs:

Enhanced Comprehension Across Documents: By utilizing inter-document links, models become proficient in grasping extended knowledge networks, crucial for domains relying heavily on interconnected information, such as scientific literature and web-based corpora.
Versatile Pretraining Framework: LinkBERT provides a versatile structure that can be adapted to various linkage types beyond hyperlinks and citations, potentially extending to other domains where document relations are prevalent.
Applications in Retrieval-Augmented Systems: The document relation understanding fostered by DRP can benefit retrieval-augmented systems, enhancing tasks like open-domain question answering, where discerning relevant context from a mix of documents is vital.

Conclusion

LinkBERT introduces an innovative direction in LM pretraining by incorporating document links, demonstrating significant performance boosts across multiple domains and tasks. The method not only enriches the knowledge base captured by LMs but also offers a robust pathway for future explorations in graph-augmented linguistic models. This paves the way for deep inter-document comprehension, setting a solid foundation for advancements in information-rich environments and domain-specific applications.

PDF Markdown

Related Papers

GitHub

GitHub - michiyasunaga/LinkBERT: [ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links (438 stars)

Tweets

https://twitter.com/michiyasunaga/status/1511382882173915137

YouTube

Show All Videos