Dissecting Contextual Word Embeddings: Architecture and Representation

Published 27 Aug 2018 in cs.CL | (1808.08949v2)

Abstract: Contextual word representations derived from pre-trained bidirectional LLMs (biLMs) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks. However, many questions remain as to how and why these models are so effective. In this paper, we present a detailed empirical study of how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. We show there is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks. Additionally, all architectures learn representations that vary with network depth, from exclusively morphological based at the word embedding layer through local syntax based in the lower contextual layers to longer range semantics such coreference at the upper layers. Together, these results suggest that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (406)

View on Semantic Scholar

Summary

The paper assesses how LSTM, Transformer, and CNN architectures affect the quality of contextual word embeddings.
It demonstrates that biLMs significantly outperform traditional GloVe vectors, with up to 25% improvements on NLP benchmarks.
Analysis reveals that lower layers capture syntax and morphology while upper layers encapsulate broader semantic information.

Dissecting Contextual Word Embeddings: Architecture and Representation

This essay provides an extensive examination of the empirical study presented in "Dissecting Contextual Word Embeddings: Architecture and Representation" (1808.08949), which evaluates the impact of different neural architectures on the effectiveness of contextual word embeddings derived from bidirectional LLMs (biLMs). The study systematically analyzes the qualitative and quantitative aspects of these embeddings and their performance across various NLP tasks.

Introduction

The paper investigates the contribution of different neural architectures—LSTM, CNN, and self-attention (Transformers)—in forming contextual word representations using pre-trained biLMs. Despite significant improvements in NLP tasks with contextual embeddings, the underlying mechanisms and differences across architectures remain partially understood. The authors aim to demystify these aspects by measuring end-task accuracy, representation properties, and associated trade-offs.

Contextual Word Representations from biLMs

Bidirectional LLMs

biLMs are constructed by training both forward and backward LLMs, which collectively maximize the log-likelihood of token sequences. Tokens are initially represented as embeddings and are processed through multiple layers of contextual encoders (e.g., LSTM, CNN, Transformer), capturing complex linguistic patterns.

Character-Based LLMs

Character-aware models provide efficient parameter usage but necessitate computationally intensive operations during training. They are shown to marginally outperform their word-based counterparts in perplexity on benchmarks.

Deep Contextual Representations

The study adopts ELMo-like architectures that combine layers from biLMs with weights optimized for downstream tasks, allowing nuanced contextual information from different layers to be leveraged effectively.

Architectures for Deep biLMs

LSTM

LSTMs, augmented with projections to control model complexity, have demonstrated their utility in multiple tasks. The study evaluates a 2-layer and a deeper 4-layer variant to explore depth impact.

Transformer

Transformers leverage attention mechanisms, eliminating the need for sequential data processing akin to RNNs. Their capacity for parallelization offers significant computational advantages during inferencing and training.

Gated CNN

Exploiting convolutional approaches, Gated CNNs employ linear units for efficient sequence modeling, achieved through multiple deep layers offering extensive receptive fields.

Evaluation as Word Representations

Various architectures were applied as pre-trained word vectors across four benchmark NLP tasks: MultiNLI, SRL, constituency parsing, and NER. Results underscored LSTM's superior effectiveness, though all architectures surpassed traditional GloVe vectors. Notably, improvements over GloVe reached up to 25% relative gains.

Properties of Contextual Vectors

The representations captured by biLMs revealed distinct linguistic information hierarchies. Lower layers encapsulate morphological and local syntactic structures, while upper layers capture broader semantic contexts and coreferential relations.

POS and Syntax

Linear probes confirmed that word vectors from lower biLM layers were adept at syntactic tasks, whereas upper layers better captured semantic relationships, corroborating the hypothesis of hierarchical learning.

Coreferential Similarity

Contextual similarity measures demonstrated biLM's ability to model coreferential relationships, validated empirically by considerable accuracy in pronominal coreference without supervision.

Conclusions and Future Work

The study illustrates that biLMs, irrespective of architecture, learn comprehensive linguistic representations, transforming them into versatile feature extractors for diverse NLP applications. Future inquiries may focus on scaling models or integrating syntactic constraints, potentially enhancing biLMs' utility in more complex scenarios.

The research posits that while biLMs excel with sizeable data and model parameters, future innovations might benefit from infusing biLMs with intuitive linguistic structures or exploring semi-supervised frameworks incorporating external supervision.

Markdown Report Issue