Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT (2004.14786v3)

Published 30 Apr 2020 in cs.CL

Abstract: By introducing a small set of additional parameters, a probe learns to solve specific linguistic tasks (e.g., dependency parsing) in a supervised manner using feature representations (e.g., contextualized embeddings). The effectiveness of such probing tasks is taken as evidence that the pre-trained model encodes linguistic knowledge. However, this approach of evaluating a LLM is undermined by the uncertainty of the amount of knowledge that is learned by the probe itself. Complementary to those works, we propose a parameter-free probing technique for analyzing pre-trained LLMs (e.g., BERT). Our method does not require direct supervision from the probing tasks, nor do we introduce additional parameters to the probing process. Our experiments on BERT show that syntactic trees recovered from BERT using our method are significantly better than linguistically-uninformed baselines. We further feed the empirically induced dependency structures into a downstream sentiment classification task and find its improvement compatible with or even superior to a human-designed dependency schema.

Citations (170)

View on Semantic Scholar

Summary

The paper introduces a parameter-free probing method, Perturbed Masking, which leverages BERT’s MLM for inferring linguistic structures.
It perturbs tokens and computes Euclidean and probability-based metrics to yield competitive syntactic parses and discourse analyses.
Empirical evaluations show that BERT’s latent representations enhance downstream tasks like sentiment analysis, indicating wider applicability.

Analysis and Interpretation of BERT using Perturbed Masking

The paper introduces a novel technique termed "Perturbed Masking" for analyzing and interpreting pre-trained LLMs, notably BERT. Unlike prior methods that rely on probes with additional parameters for linguistic task evaluations, this approach offers a parameter-free probing mechanism. It leverages the intrinsic properties of BERT's masked LLMing (MLM) to infer syntactic patterns without explicit supervision or added parameters, thus aiming to reduce the confounding effects associated with parameterized probes.

Methodology

The Perturbed Masking technique assesses the impact of individual words on one another within BERT's contextual representation. By perturbing token sequences, specifically replacing select tokens with the [MASK] token, the technique calculates the disturbance in predicted contextual embeddings, thereby gauging inter-word relationships. Two variants of the distance metric are used to compute the perturbation impact: the Euclidean distance (Dist) and a probability-based difference (Prob).

A similar perturbation process is extended to analyze span-level structures in documents, allowing for the investigation of document-level discourse properties encoded within BERT.

Empirical Evaluation

The technique is applied across several linguistic tasks:

Syntactic Analysis: By extracting impact matrices from token perturbation, syntactic trees are induced using algorithms like Eisner and Chu-Liu/Edmonds. Comparisons against baselines such as right-chain models demonstrate superior prediction of syntactic structures, although with a modest margin.
Constituency Parsing: Utilizing a top-down parsing approach inspired by ON-LSTM, the Perturbed Masking technique shows competitive F1 scores on benchmarks like WSJ10 and PTB23, identifying phrase and clause structures without explicit syntactic supervision.
Discourse Analysis: The approach analyzes document structure through discourse dependency parsing, producing impact matrices to evaluate EDU-level relationships. Despite falling short of linguistically-informed models, it underscores BERT's capacity for capturing longer context dependencies.

Implications in Downstream Applications

The paper investigates the potential utility of BERT-derived syntactic structures in sentiment classification tasks, such as Aspect Based Sentiment Classification (ABSC). While variations from human-designed parsers like SpaCy exist, empirical tests reveal comparable or improved performance using BERT's structures, indicating the model's proficiency in learning beneficial linguistic representations.

Conclusions and Future Directions

The Perturbed Masking approach provides an alternative pathway for understanding BERT's encoding of syntactic and discourse properties without inference biases introduced by additional probe parameters. By analyzing output perturbations, the technique aligns closely with the intrinsic operations of BERT, offering insights into its latent acquisition of language structure.

Future research could explore broader linguistic phenomena and expand validation across diverse downstream tasks to further elucidate the practical applicability of BERT's learned representations. Additionally, improvements in unsupervised dependency parsing methods might enhance the interpretability of LLMs, contributing to the development of more linguistically-informed architectures.

PDF Markdown