Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 69 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models (1908.06725v5)

Published 19 Aug 2019 in cs.CL

Abstract: The state-of-the-art pre-trained language representation models, such as Bidirectional Encoder Representations from Transformers (BERT), rarely incorporate commonsense knowledge or other knowledge explicitly. We propose a pre-training approach for incorporating commonsense knowledge into language representation models. We construct a commonsense-related multi-choice question answering dataset for pre-training a neural language representation model. The dataset is created automatically by our proposed "align, mask, and select" (AMS) method. We also investigate different pre-training tasks. Experimental results demonstrate that pre-training models using the proposed approach followed by fine-tuning achieve significant improvements over previous state-of-the-art models on two commonsense-related benchmarks, including CommonsenseQA and Winograd Schema Challenge. We also observe that fine-tuned models after the proposed pre-training approach maintain comparable performance on other NLP tasks, such as sentence classification and natural language inference tasks, compared to the original BERT models. These results verify that the proposed approach, while significantly improving commonsense-related NLP tasks, does not degrade the general language representation capabilities.

Citations (67)

Summary

  • The paper presents a novel AMS pre-training strategy that aligns, masks, and selects concepts to create a multi-choice QA dataset from ConceptNet.
  • The AMS method significantly boosts commonsense reasoning, with BERT_CS_large achieving 5.5% and 3.3% improvements on CSQA and WSC respectively.
  • The approach maintains robust general language performance while suggesting that multi-choice QA pre-training outperforms traditional masked language modeling.

Align, Mask and Select: Incorporating Commonsense Knowledge into LLMs

The paper "Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models" (1908.06725) introduces a pre-training methodology to inject commonsense knowledge into language representation models, specifically BERT. The approach centers around an "align, mask, and select" (AMS) method for automated construction of a multiple-choice question answering dataset derived from ConceptNet and a large text corpus. This pre-training strategy aims to enhance commonsense reasoning capabilities without compromising the general language understanding abilities of the model.

Methodology: Align, Mask, and Select (AMS)

The AMS method is designed to create a multi-choice question answering dataset from a commonsense knowledge graph (KG) and a large text corpus. The process involves three key steps, as illustrated in Table 1:

  1. Align: Align triples of (concept1, relation, concept2) in the filtered triple set to the English Wikipedia dataset to extract sentences containing the two concepts.
  2. Mask: Replace either concept1 or concept2 in a sentence with a special token [QW], thus transforming the sentence into a question where the masked concept is the correct answer.
  3. Select: Generate distractor answer choices by identifying concepts that share the same relation with the unmasked concept in ConceptNet. Figure 1

    Figure 1: BERT_CSbase_{base} and BERT_CSlarge_{large} accuracy on the CSQA development set against the number of pre-training steps.

The paper filters ConceptNet triples to retain only those relevant to commonsense reasoning. The triples are filtered based on the following criteria: the concepts must be English words, the relations should not be too general ("RelatedTo" or "IsA"), and the concepts should meet length and edit distance requirements. The final dataset, denoted as DAMS\mathcal{D}_{AMS}, comprises 16,324,846 multi-choice QA samples.

Pre-training and Fine-tuning

The BERT_CS models, both base and large, are initialized with pre-trained weights from Google's BERT models and then further pre-trained on the generated DAMS\mathcal{D}_{AMS} dataset using a multi-choice QA task. The objective function used during pre-training is:

L=−logp(ci∣s)L = - {\rm logp}(c_i|s)

${\rm p}(c_i|s) = \frac{\rm exp}(\mathbf{w}^{T}\mathbf{c}_{i})}{\sum_{k=1}^{N}{\rm exp}(\mathbf{w}^{T}\mathbf{c}_{k})}$

where cic_i is the correct answer, w\mathbf{w} represents parameters in the softmax layer, NN is the number of candidates, and ci\mathbf{c}_i represents the vector representation of the [CLS] token.

Following pre-training, the BERT_CS models are fine-tuned on downstream NLP tasks, with a particular focus on commonsense reasoning benchmarks.

Experimental Results

The paper evaluates the proposed approach on two commonsense reasoning benchmarks, CommonsenseQA (CSQA) and Winograd Schema Challenge (WSC), as well as the GLUE benchmark for general language understanding. The results indicate that BERT_CS models achieve significant improvements on CSQA and WSC compared to baseline BERT models and previous state-of-the-art models. Specifically, BERT_CSlarge_{large} achieved a 5.5% absolute gain over the baseline BERTlarge_{large} model on the CSQA test set. On the WSC dataset, BERT_CSlarge_{large} achieved a 3.3% absolute improvement over previous state-of-the-art results. Furthermore, the BERT_CS models maintain comparable performance on the GLUE benchmark, demonstrating that the proposed pre-training approach does not degrade the general language representation capabilities of the models.

Ablation Studies and Analysis

The paper includes ablation studies to analyze the impact of different data creation approaches and pre-training tasks. Key findings from the ablation studies include:

  • Pre-training on ConceptNet benefits the CSQA task, even when using triples as input instead of sentences.
  • Constructing natural language sentences as input for pre-training BERT performs better on the CSQA task than pre-training using triples.
  • Using a more difficult dataset with carefully selected distractors improves performance.
  • The multi-choice QA task works better than the masked language modeling (MLM) task for the target multi-choice QA task.

Error analysis on the WSC dataset reveals that BERT_CSlarge_{large} is less influenced by proximity and more focused on semantics compared to BERTlarge_{large}.

Implications and Future Directions

This research has several implications for the field of NLP. The AMS method provides a way to incorporate structured knowledge from KGs into unstructured LLMs in an automated fashion. The approach improves the commonsense reasoning capabilities of LLMs, a long-standing challenge in AI. The finding that a multi-choice QA task is more effective than MLM for pre-training on commonsense knowledge suggests new avenues for exploration in pre-training strategies. Future work may focus on scaling the approach to larger KGs and LLMs, as well as exploring different pre-training tasks and fine-tuning strategies. Additionally, incorporating commonsense knowledge into models such as XLNet and RoBERTa is a promising direction for future research.

Conclusion

The paper "Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models" (1908.06725) presents a practical and effective approach for incorporating commonsense knowledge into language representation models. By pre-training BERT models on a multi-choice QA dataset constructed using the AMS method, the authors demonstrate significant improvements on commonsense reasoning tasks while maintaining performance on general language understanding tasks. The ablation studies provide valuable insights into the design of pre-training strategies for commonsense reasoning, and the error analysis sheds light on the strengths and limitations of the proposed approach. This research contributes to the growing body of work on knowledge-enhanced LLMs and opens up new avenues for future research in this area.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube