Targeted Syntactic Evaluation of Language Models

Published 27 Aug 2018 in cs.CL | (1808.09031v1)

Abstract: We present a dataset for evaluating the grammaticality of the predictions of a LLM. We automatically construct a large number of minimally different pairs of English sentences, each consisting of a grammatical and an ungrammatical sentence. The sentence pairs represent different variations of structure-sensitive phenomena: subject-verb agreement, reflexive anaphora and negative polarity items. We expect a LLM to assign a higher probability to the grammatical sentence than the ungrammatical one. In an experiment using this data set, an LSTM LLM performed poorly on many of the constructions. Multi-task training with a syntactic objective (CCG supertagging) improved the LSTM's accuracy, but a large gap remained between its performance and the accuracy of human participants recruited online. This suggests that there is considerable room for improvement over LSTMs in capturing syntax in a LLM.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (390)

View on Semantic Scholar

Summary

The paper introduces a dataset of minimally different sentence pairs specifically designed to evaluate language models' grasp of English syntax.
It shows that RNN models excel in capturing local syntactic dependencies but struggle with non-local ones such as object relative clauses.
Multi-task training with CCG supertagging improves model performance, suggesting a promising approach for enhancing syntactic understanding.

Targeted Syntactic Evaluation of LLMs: An Analysis

The paper "Targeted Syntactic Evaluation of LLMs" by Rebecca Marvin and Tal Linzen introduces a nuanced dataset aimed at the syntactic evaluation of LMs. The dataset comprises minimally different pairs of English sentences, one grammatical and the other ungrammatical, across various syntactic phenomena including subject-verb agreement, reflexive anaphora, and negative polarity items. Through this approach, the authors propose a method to assess whether a LLM accurately captures grammatical rules by preferring grammatical over ungrammatical sentences.

Methodology and Experimental Setup

The authors construct sentence pairs to encompass specific syntactic challenges that are often obscured in traditional evaluation metrics like perplexity. By employing templates combined with non-recursive context-free grammars, a corpus exceeding 350,000 sentences pairs is automatically generated. This approach facilitates control over syntactic constructs and minimizes semantic or collocational cues that might inadvertently assist a model in distinguishing between grammatical and ungrammatical sentences.

Three LLMs were evaluated: an n-gram baseline, an RNN LM trained on an unannotated corpus, and an RNN LM trained on a multi-task objective combining language modeling and CCG supertagging. Human experiments conducted via Amazon Mechanical Turk provided a comparative baseline for LM performance.

Key Findings

The study reveals that RNN models outperform n-gram models in capturing local syntactic dependencies, such as simple subject-verb agreement. However, RNN performance markedly declines on non-local dependencies, such as agreement across object relative clauses, where accuracy often approached chance. In contrast, multi-task training incorporating CCG supertagging improved performance, although human accuracy remained superior across most syntactic situations evaluated.

The results demonstrate that LMs, particularly RNNs, still grapple with complexities inherent in syntactic structures despite advances in architecture. The authors note an intriguing sensitivity of RNN performance to lexical variations, suggesting that certain verbs or pronouns disproportionately influence model accuracy due to their frequency or contextual predictability in the training corpus.

Practical Implications and Future Directions

This study offers valuable insights into LM syntactic capabilities and the potential of multi-task learning as a vehicle for improved syntactic understanding. The constructed dataset sets a foundation for further advancements in LM architectures and evaluation methods, moving beyond the limitations of perplexity.

Future research could expand this evaluation framework to other linguistic phenomena or explore additional architectural innovations to bridge the performance gap with human syntactic comprehension. The findings serve as a call to the computational linguistics community to push for models that not only approximate human performance in lexical tasks but also excel in syntactic understanding.

By increasing elucidation on syntactic intricacies and refining assessment methodologies, the research propels a nuanced exploration of language comprehension in computational models, paving the way for deeper alignment with natural language syntax.

Markdown Report Issue