Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model

Published 20 Dec 2019 in cs.CL | (1912.09637v1)

Abstract: Recent breakthroughs of pretrained LLMs have shown the effectiveness of self-supervised learning for a wide range of NLP tasks. In addition to standard syntactic and semantic NLP tasks, pretrained models achieve strong improvements on tasks that involve real-world knowledge, suggesting that large-scale language modeling could be an implicit method to capture knowledge. In this work, we further investigate the extent to which pretrained models such as BERT capture knowledge using a zero-shot fact completion task. Moreover, we propose a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities. Models trained with our new objective yield significant improvements on the fact completion task. When applied to downstream tasks, our model consistently outperforms BERT on four entity-related question answering datasets (i.e., WebQuestions, TriviaQA, SearchQA and Quasar-T) with an average 2.7 F1 improvements and a standard fine-grained entity typing dataset (i.e., FIGER) with 5.7 accuracy gains.

Citations (196)

Summary

  • The paper introduces a novel weakly supervised pretraining method using entity replacement to boost entity-specific knowledge acquisition.
  • Evaluation on zero-shot fact completion and question answering tasks shows WKLM outperforms BERT and GPT-2 in capturing real-world knowledge.
  • The study paves the way for integrating entity-centric objectives in NLP models to improve performance across diverse knowledge-driven applications.

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained LLM

The research article titled "Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained LLM" presents a novel approach to enhancing the capability of pretrained LLMs to encode and retrieve real-world encyclopedic knowledge. This endeavor is motivated by the observation that self-supervised pretrained models like BERT exhibit impressive performance on tasks requiring real-world knowledge alongside typical syntactic and semantic language understanding.

Weakly Supervised Pretraining Methodology

The central proposition of this work is the introduction of a weakly supervised pretraining objective called entity replacement training. Unlike traditional approaches that depend heavily on structured knowledge bases, this method relies directly on unstructured text data such as Wikipedia. The process involves recognizing and replacing entity mentions in the text with other entities of the same type, thereby generating negative samples to force the model to distinguish true knowledge from the false. This entity-centric approach promotes the learning of entity-level knowledge during the pretraining phase.

Evaluation and Results

The efficacy of the proposed model (referred to as WKLM) is evaluated using a zero-shot fact completion task across ten common Wikidata relations. WKLM significantly outperforms its contemporaries, BERT and GPT-2, achieving the most robust performance overall. This fact completion step acts as a proxy to assess how well the model can fill in missing entity-based knowledge, akin to knowledge base completion tasks.

WKLM is further tested on real-world applications requiring entity knowledge: open-domain question answering and fine-grained entity typing. Across multiple question answering datasets such as WebQuestions, TriviaQA, Quasar-T, and SearchQA, WKLM yields notable improvements over BERT models, particularly when incorporating paragraph ranking scores. Furthermore, WKLM sets a new standard on the FIGER dataset for entity typing, demonstrating the practical advantages of the proposed pretraining strategy in capturing and utilizing entity-centric knowledge.

Implications and Future Work

This study offers an important advancement in how LLMs can be trained to internalize and leverage entity-based knowledge without structured databases or additional memory requirements during downstream tasks. The implicit encoding of real-world knowledge suggests future directions, where pretrained models could be further enriched, leading to improved performance across a broader spectrum of NLP tasks, particularly those involving complex entity reasoning.

Future work could explore larger corpora beyond Wikipedia, employ advanced entity linking methods for improved precision, and examine the integration and balance between entity-focused objectives and standard masked language modeling. As the field moves towards building more knowledgeable AI systems, methods such as WKLM provide a promising path to bridging the gap between language comprehension and encyclopedic understanding.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.