What you can cram into a single vector: Probing sentence embeddings for linguistic properties (1805.01070v2)

Published 3 May 2018 in cs.CL

Abstract: Although much effort has recently been devoted to training high-quality sentence embeddings, we still have a poor understanding of what they are capturing. "Downstream" tasks, often based on sentence classification, are commonly used to evaluate the quality of sentence representations. The complexity of the tasks makes it however difficult to infer what kind of information is present in the representations. We introduce here 10 probing tasks designed to capture simple linguistic features of sentences, and we use them to study embeddings generated by three different encoders trained in eight distinct ways, uncovering intriguing properties of both encoders and training methods.

Authors (5)

Alexis Conneau (33 papers)
Guillaume Lample (31 papers)
Loïc Barrault (34 papers)
Marco Baroni (58 papers)
German Kruszewski (4 papers)

Citations (853)

View on Semantic Scholar

Summary

The paper reveals that different encoder architectures capture varying degrees of syntactic and semantic information via ten probing tasks.
It employs diverse training methods—including NMT, NLI, and BiLSTM-max—to systematically assess the linguistic capabilities of sentence embeddings.
Findings indicate that while simple lexical features correlate with downstream performance, nuanced syntactic tasks offer practical insights for refining embeddings.

Probing Sentence Embeddings for Linguistic Properties

In the pursuit of understanding the linguistic properties encapsulated by sentence embeddings, the paper titled "What you can cram into a single vector: Probing sentence embeddings for linguistic properties" by Conneau et al. sets out an empirical paper that scrutinizes such embeddings using a variety of probing tasks. This examination is aimed at revealing the extent to which these embeddings capture syntactic and semantic information inherent in sentences.

Overview of Probing Tasks

The authors introduce a suite of ten probing tasks designed to analyze the linguistic capabilities of sentence embeddings. These tasks are:

Sentence Length (SentLen): Determines if embedding can predict the length of a sentence.
Word Content (WC): Tests whether embeddings can reflect the presence of specific words.
Bigram Shift (BShift): Assesses if embeddings detect unnatural word orders.
Tree Depth (TreeDepth): Evaluates the capturing of hierarchical sentence structure.
Top Constituents (TopConst): Classifies sentences based on their top constituents in parse trees.
Tense: Identifies the tense of main clause verbs.
Subject Number (SubjNum): Determines the number (singular/plural) of the subject.
Object Number (ObjNum): Evaluates the capturing of the number of the object.
Semantic Odd Man Out (SOMO): Detects semantic anomalies in sentences.
Coordination Inversion (CoordInv): Checks the ability to detect clause order inversion in sentences.

These tasks are derived from sentences in the Toronto Book Corpus, ensuring data comprehensiveness and linguistic diversity. Task sets are balanced and controlled for nuisance factors such as sentence length and lexical content to ensure that they probe the intended linguistic properties accurately.

Sentence Embedding Models and Training Methods

The paper evaluates three encoder architectures: BiLSTM-last, BiLSTM-max, and Gated ConvNet. These architectures are trained on various tasks including Neural Machine Translation (NMT) for multiple language pairs (En-Fr, En-De, En-Fi), autoencoding, Seq2Tree, SkipThought, and Natural Language Inference (NLI), to generate sentence embeddings. These methodologies represent a broad spectrum of both supervised and unsupervised learning techniques.

Results and Analysis

The results reveal substantial variance in the linguistic properties captured by different models and training regimes. The BiLSTM-max encoder, notably, demonstrates a strong initial capacity to capture linguistic information even when untrained, underscoring the effectiveness of its architectural bias. On the other hand, models trained on NMT tasks show more proficiency in capturing complex linguistic features compared to those trained on NLI, although NLI models exhibit superior performance in downstream NLP tasks.

The probing results are corroborated against various strong baselines like Naive Bayes with tf-idf features and Bag-of-Vectors using fastText embeddings, illustrating the relative effectiveness of pre-trained embedding models. A noteworthy observation is that simpler word content and order features, although not challenging, still provide significant utility in downstream NLP applications, reflecting perhaps a tendency of current embeddings to overfit on superficial sentence properties.

Correlations with Downstream Tasks

The paper includes an evaluation of embedding methods on well-established downstream tasks (e.g., sentiment analysis, question classification, paraphrase detection). Correlational analysis between probing tasks and downstream task performance reveals intriguing patterns: simple word content tasks correlate positively with downstream performance, whereas more nuanced syntactic and semantic tasks such as SOMO and CoordInv also maintain a positive correlation, albeit less pronounced.

This correlation matrix underscores the reality that while deep syntactic properties are valuable, the success in downstream tasks might still heavily rely on shallower lexical features.

Implications and Future Developments

The systematic probing of sentence embeddings established by this paper serves dual purposes: it provides a diagnostic toolset for linguistic capability assessment and also hints at the directions for developing more robust, linguistically informed embedding techniques. Future work could extend probing tasks to other languages and operationally refine multi-task training methodologies to amplify the embedding performance on both probing and downstream tasks.

Conclusion

The paper provides insightful research into the linguistic properties encoded by sentence embeddings through a methodical set of probing tasks. By analyzing different encoder architectures and training methods, the paper offers a granular understanding of what these models capture, thus paving the way for more linguistically sophisticated embedding strategies in future AI developments.

PDF Markdown