Masked Language Model Scoring

Published 31 Oct 2019 in cs.CL, cs.LG, eess.AS, and stat.ML | (1910.14659v3)

Abstract: Pretrained masked LLMs (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive LLMs like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeech model's WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation. We attribute this success to PLL's unsupervised expression of linguistic acceptability without a left-to-right bias, greatly improving on scores from GPT-2 (+10 points on island effects, NPI licensing in BLiMP). One can finetune MLMs to give scores without masking, enabling computation in a single inference pass. In all, PLLs and their associated pseudo-perplexities (PPPLs) enable plug-and-play use of the growing number of pretrained MLMs; e.g., we use a single cross-lingual model to rescore translations in multiple languages. We release our library for LLM scoring at https://github.com/awslabs/mlm-scoring.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces PLLs as an alternative scoring method that outperforms autoregressive models in evaluating sentence fluency and linguistic acceptability.
The paper shows that rescoring outputs with PLLs achieves up to a 30% reduction in WER and a +1.7 BLEU improvement on low-resource NMT tasks.
The paper illustrates PLLs’ versatility by enabling multilingual evaluation and unsupervised judgments, improving specific linguistic phenomena by up to 10%.

Analysis of "Masked LLM Scoring"

The paper "Masked LLM Scoring" presents a detailed exploration of utilizing pseudo-log-likelihood scores (PLLs) from masked LLMs (MLMs) for evaluating and improving NLP tasks. The authors introduce the concept of using PLLs as an alternative to traditional autoregressive model scores, such as those used in GPT-2, and demonstrate their advantages across various NLP applications.

Overview

Masked LLMs, like BERT and RoBERTa, traditionally require fine-tuning to perform specific NLP tasks. Instead, this paper evaluates these models using PLLs, computed by sequentially masking tokens and calculating log probabilities. This scoring method allows using MLMs out-of-the-box for tasks such as automatic speech recognition (ASR) and neural machine translation (NMT). The authors show that PLLs outperform GPT-2 scores, particularly in rescoring ASR and NMT outputs, achieving significant improvements in word error rate (WER) and BLEU score.

Numerical Results

The paper presents robust numerical improvements when using PLLs:

RoBERTa reduces the WER of an end-to-end LibriSpeech model by up to 30% relative and achieves up to a +1.7 BLEU improvement on low-resource NMT pairs.
PLLs also facilitate unsupervised linguistic acceptability judgments, improving results by +10% on specific phenomena such as island effects and negative polarity items (NPI) licensing.

Key Contributions and Implications

PLLs as Evaluation Metrics: PLLs provide a more reliable scoring method for sentence fluency without the left-to-right bias inherent in autoregressive models. This characteristic allows for more accurate fluency assessments and unsupervised acceptability judgments of LLMs.
Applications in Rescoring: The use of PLLs in rescoring ASR and NMT outputs demonstrates a clear practical advantage, boosting the performance of already high-performing systems. This implies a broader potential for MLMs in tasks that traditionally rely on sequential processing LMs.
Efficient Scoring Techniques: Finetuning MLMs to score without masking expedites computation, enabling more resource-efficient inference processes.
Multilingual and Cross-domain Use: By leveraging a cross-lingual model, the authors show that it is feasible to apply MLMs to multiple languages simultaneously, suggesting implications for multilingual NLP tasks.
Pseudo-perplexity (PPPL): Introduced as an intrinsic evaluation metric, PPPL offers an alternative way to assess MLM performance on sentence-level and corpus-level tasks, analogous to perplexity in conventional LLMs.

Theoretical and Practical Implications

This research provides a foundation for adopting masked models for scoring tasks, proposing a shift from the predominance of sequential models. The presented improvements, particularly in multilingual settings and the ability to adapt to different domains through domain adaptation, open the door for more versatile applications of MLMs. Furthermore, as the landscape of LLMs continues to evolve, the methodologies and findings from this work could be instrumental in shaping future model architectures and evaluation paradigms.

Future Directions

While the paper shows promising results, there are areas for further exploration, such as improving maskless scoring methods and extending the application of PLLs to broader and more diverse NLP tasks. Additionally, investigating the integration of PLLs with other LLMs could yield synergistic improvements for combined model use cases.

In conclusion, this paper makes a substantial contribution to the NLP field by showcasing the utility of PLLs in MLMs, thereby challenging traditional autoregressive model applications and setting a precedent for future research in efficient and effective LLM utilization.

Markdown Report Issue