PERT: Pre-training BERT with Permuted Language Model

Published 14 Mar 2022 in cs.CL | (2203.06906v1)

Abstract: Pre-trained LLMs (PLMs) have been widely used in various NLP tasks, owing to their powerful text representations trained on large-scale corpora. In this paper, we propose a new PLM called PERT for natural language understanding (NLU). PERT is an auto-encoding model (like BERT) trained with Permuted LLM (PerLM). The formulation of the proposed PerLM is straightforward. We permute a proportion of the input text, and the training objective is to predict the position of the original token. Moreover, we also apply whole word masking and N-gram masking to improve the performance of PERT. We carried out extensive experiments on both Chinese and English NLU benchmarks. The experimental results show that PERT can bring improvements over various comparable baselines on some of the tasks, while others are not. These results indicate that developing more diverse pre-training tasks is possible instead of masked LLM variants. Several quantitative studies are carried out to better understand PERT, which might help design PLMs in the future. Resources are available: https://github.com/ymcui/PERT

Abstract PDF Upgrade to Chat

Authors (3)

Citations (32)

View on Semantic Scholar

Summary

The paper introduces PERT, which replaces traditional masked language modeling with a permuted token prediction task to enhance contextual understanding in NLU.
The paper employs a methodology combining permuted token predictions with whole word and n-gram masking to innovate pre-training strategies.
The paper demonstrates that while PERT excels in machine reading comprehension and named entity recognition, it underperforms in text classification, highlighting task-specific trade-offs.

Analyzing PERT: Pre-training BERT with Permuted LLM

The paper proposes a novel pre-trained LLM (PLM) called PERT, which focuses on natural language understanding (NLU) by employing a Permuted LLM (PerLM) for training. The authors deviate from the conventional Masked LLM (MLM) task used in models like BERT, to explore pre-training tasks that involve predicting the original position of tokens in permuted text sequences. This methodology challenges the traditional MLM paradigm, aiming to enhance the diversity of pre-training tasks in PLMs.

Methodology

PERT, an auto-encoding model similar to BERT, introduces PerLM as its primary pre-training task. During training, a portion of the input text is permuted, and the model's objective is to infer the original token positions. This task is performed alongside techniques like whole word masking and N-gram masking to potentially boost model performance by emphasizing token grouping and continuity.

The authors conducted extensive experiments on both Chinese and English NLU tasks, covering machine reading comprehension (MRC), text classification (TC), and named entity recognition (NER). The results suggest that PERT shows notable improvements on certain tasks, particularly in MRC and NER, yet it does not uniformly outperform MLM-based models across all NLU tasks, notably lagging in text classification.

Results and Discussion

Machine Reading Comprehension: PERT exhibited improvements over baselines, particularly in the ability to handle permuted sequences effectively, suggesting enhanced contextual understanding.
Text Classification: PERT's performance was suboptimal compared to traditional MLM models, indicating that permutation-based pre-training introduces challenges that may hinder straightforward text categorization tasks.
Named Entity Recognition: The model showed consistent enhancements, likely benefitting from the emphasis on sequence structure inherent in PerLM.

The model’s contrasting performance across tasks implies that while permutation can enhance contextual inference, it may disrupt semantic interpretation vital to simpler sentence-level tasks like TC.

Implications and Future Directions

This exploration into PerLM introduces significant implications for the future of PLMs. The authors provide evidence that alternative pre-training tasks, which eschew traditional MLM strategies, may offer distinct advantages in certain contexts. However, the mixed results underscore the need for ongoing experimentation with task diversity and granularity in permuted approaches.

Future research could focus on refining permutation strategies, such as adjusting their granularity or incorporating hybrid models that balance permutation with token prediction, to address the specific limitations observed in text classification. Additionally, investigating the cognitive parallels between human reading of permuted text and model interpretation might offer novel insights for linguistic representation in AI.

Overall, by questioning the established paradigms of LLM pre-training, PERT fosters a dialogue on the necessity and potential of diverse pre-training tasks tailored to specific linguistic challenges.

Markdown Report Issue