Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little (2104.06644v2)

Published 14 Apr 2021 in cs.CL and cs.LG

Abstract: A possible explanation for the impressive performance of masked LLM (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks -- including on tasks specifically designed to be challenging for models that ignore word order. Our models perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.

Authors (6)

Koustuv Sinha (31 papers)
Robin Jia (59 papers)
Dieuwke Hupkes (49 papers)
Joelle Pineau (123 papers)
Adina Williams (72 papers)
Douwe Kiela (85 papers)

Citations (229)

View on Semantic Scholar

Summary

The paper demonstrates that masked language models excel by capturing higher-order word co-occurrence rather than relying on natural syntactic order.
Experiments reveal that models pre-trained on permuted word orders perform nearly as well as those trained on natural orders on tasks like QQP, SST-2, MRPC, and MNLI.
The study calls for revised NLP benchmarks that challenge models’ syntactic and semantic understanding beyond mere distributional statistics.

Insights from "Masked LLMing and the Distributional Hypothesis: Order Word Matters Pre-training for Little"

The paper "Masked LLMing and the Distributional Hypothesis: Order Word Matters Pre-training for Little" critically examines the inner workings of masked LLMs (MLMs) like BERT, suggesting an alternative explanation of their success beyond syntactic structure representation. The core proposition is that MLMs excel primarily due to their ability to model higher-order word co-occurrence statistics rather than a mastery of linguistic syntax and semantics traditionally needed in NLP.

Key Findings

Robustness to Word Order Permutation: The foundational experiment involves pre-training MLMs on corpora where sentence-level word order is randomized. These models are then fine-tuned on various downstream tasks. The results show that even with permuted word orders during pre-training, models perform competitively across several tasks, indicating that MLMs do not fundamentally rely on syntactic order.
Performance on Downstream Tasks: Models trained on permuted data only slightly underperform compared to those trained on natural data. These findings are consistent across tasks such as QQP, SST-2, MRPC, and MNLI, where the permuted pre-training models nearly match the performance of the naturally pre-trained models.
Role of Distributional Information: The experiments revealed that words' co-occurrence statistics play a substantial role in the success of MLMs. Models trained on a corpus with this distributional information largely retain their performance on downstream evaluations.
Parametric and Non-parametric Probing: The paper explores further using parametric (via dependency parsing and SentEval probing tasks) and non-parametric probes. These experiments indicate that while parametric probes affirm syntactic sensitivity, non-parametric probes highlight deficiencies, suggesting models do not effectively reconstruct natural word order.
Implications on NLP Benchmarks: The authors argue for the need to revisit the evaluation standards and benchmarks in NLP since current setups may not adequately test the linguistic capabilities that humans presume are necessary.

Implications

This paper sheds light on the over-attribution of linguistic proficiency to masked LLMs and calls into question the current benchmarks. The research implies that much of what is interpreted as linguistic understanding might simply be the models' adeptness at recognizing statistical patterns in data. This insight could lead to developing more complex benchmarks that truly challenge models’ syntactic and semantic reasoning capabilities.

Future Directions

The findings encourage the exploration of multiple avenues in AI and NLP:

Cross-linguistic Studies: Investigating whether the diminished importance of word order holds in languages with different syntactic and morphological structures could provide more comprehensive insights.
Impact on Generative Models: Evaluating if pre-training with unnatural word orders affects generative tasks like machine translation or text summarization differently compared to classification tasks.
Development of More Rigid Tasks: Creation of novel benchmarks and datasets that require models to apply complex syntactic and semantic rules to achieve correct answers, moving beyond the simple leveraging of co-occurrence statistics.

In conclusion, this paper highlights the need to understand the fundamental underpinnings of LLMs. Progress hinges on our ability to design evaluation mechanisms that uniquely probe and challenge different facets of linguistic competence. The insights from this paper could inform the trajectory of transformer-based models towards more truly human-like language understanding systems.

PDF Markdown