Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings (1909.10430v2)

Published 23 Sep 2019 in cs.CL

Abstract: Contextualized word embeddings (CWE) such as provided by ELMo (Peters et al., 2018), Flair NLP (Akbik et al., 2018), or BERT (Devlin et al., 2019) are a major recent innovation in NLP. CWEs provide semantic vector representations of words depending on their respective context. Their advantage over static word embeddings has been shown for a number of tasks, such as text classification, sequence tagging, or machine translation. Since vectors of the same word type can vary depending on the respective context, they implicitly provide a model for word sense disambiguation (WSD). We introduce a simple but effective approach to WSD using a nearest neighbor classification on CWEs. We compare the performance of different CWE models for the task and can report improvements above the current state of the art for two standard WSD benchmark datasets. We further show that the pre-trained BERT model is able to place polysemic words into distinct 'sense' regions of the embedding space, while ELMo and Flair NLP do not seem to possess this ability.

Citations (171)

View on Semantic Scholar

Summary

The paper presents a kNN-based method that leverages BERT’s contextualized embeddings for interpretable word sense disambiguation.
It shows state-of-the-art performance on SensEval-2 and SensEval-3 through effective clustering of polysemous words.
The research opens avenues for unsupervised sense induction and improved clustering techniques in advanced NLP applications.

Interpretable Word Sense Disambiguation with Contextualized Embeddings

The paper "Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings" presents a method for word sense disambiguation (WSD) utilizing contextualized word embeddings (CWEs). CWEs from models such as BERT, ELMo, and Flair offer advanced capabilities in providing semantic vector representations of words in context, surpassing static embeddings in capturing polysemy.

Overview and Approach

The authors propose using CWEs directly for WSD by employing a k-nearest neighbor (kNN) classification approach. This method leverages the contextualization aspect of CWEs, which inherently produce distinct vectors for the same token in varying contexts, implicitly modeling for WSD without a predefined sense inventory. By relying on kNN, the model gains interpretability through traceable provenance of training sentences influencing classification decisions.

Results and Comparisons

The paper reports significant improvements in WSD performance using BERT embeddings over other CWE models and establishes new state-of-the-art benchmarks on two standard WSD datasets, SensEval-2 and SensEval-3. The evaluation reveals distinct clustering capabilities of BERT embeddings, whereby polysemic words are grouped into specific sense regions, contrasting with ELMo and Flair, which do not exhibit the same degree of separability.

Implications

The findings underscore the potential of transformer-based models like BERT in encoding meaningful semantic distinctions beyond traditional embeddings. In practice, this may lead to more robust and interpretable NLP systems capable of sophisticated language understanding tasks like WSD across diverse contexts.

Future Directions

The authors suggest exploring unsupervised sense induction through cluster analyses within CWE spaces. They also hint at extending evaluations to newer models like RoBERTa, XLNet, and others that build upon BERT's architecture. Understanding near-miss errors and investigating more complex classifiers might further mitigate issues stemming from sparse training datasets.

In conclusion, this research demonstrates BERT's proficiency in disentangling word senses through contextual embeddings, advancing the methodology for accurate and interpretable word sense disambiguation in NLP.