BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation (2210.07626v1)

Published 14 Oct 2022 in cs.CL and cs.AI

Abstract: Automatic evaluation metrics are crucial to the development of generative systems. In recent years, pre-trained LLM (PLM) based metrics, such as BERTScore, have been commonly adopted in various generation tasks. However, it has been demonstrated that PLMs encode a range of stereotypical societal biases, leading to a concern on the fairness of PLMs as metrics. To that end, this work presents the first systematic study on the social bias in PLM-based metrics. We demonstrate that popular PLM-based metrics exhibit significantly higher social bias than traditional metrics on 6 sensitive attributes, namely race, gender, religion, physical appearance, age, and socioeconomic status. In-depth analysis suggests that choosing paradigms (matching, regression, or generation) of the metric has a greater impact on fairness than choosing PLMs. In addition, we develop debiasing adapters that are injected into PLM layers, mitigating bias in PLM-based metrics while retaining high performance for evaluating text generation.

Citations (40)

View on Semantic Scholar

Summary

The paper finds that language model-based metrics like BERTScore contain significant social bias across attributes such as race and gender, unlike traditional metrics like BLEU or ROUGE.
The study identifies that the paradigm used in PLM-based metrics and intrinsic PLM bias are major sources of unfairness, analyzed using specially constructed datasets.
The authors propose debiasing adapters and the use of debiased language models to effectively reduce social bias in evaluation metrics while maintaining their performance.

The paper "BERTScore is Unfair: On Social Bias in LLM-Based Metrics for Text Generation" presents a comprehensive analysis of social bias in pre-trained LLM (PLM)-based metrics used for automatic text generation evaluation. The authors highlight that PLM-based metrics, such as BERTScore, incorporate significant social biases compared to traditional metrics (e.g., BLEU or ROUGE), which can lead to unfair evaluation outcomes in generative models.

Key Findings:

Social Bias Analysis:
- The paper includes six sensitive attributes: race, gender, religion, physical appearance, age, and socioeconomic status.
- PLM-based metrics exhibit significantly higher bias across all these attributes compared to traditional $n$ -gram-based metrics.
Analysis of Bias Sources:
- The paradigms (matching, regression, or generation) employed in PLM-based metrics have a more substantial impact on fairness compared to the choice of PLMs themselves.
- Intrinsic biases are encoded within PLMs, while extrinsic biases can be introduced when adapting PLMs for use as evaluation metrics.
Mitigation Techniques:
- The paper develops debiasing adapters for PLMs to address social bias effectively. These adapters are incorporated into PLM layers without altering the original PLM parameters, thereby retaining evaluation performance while reducing bias.
- The authors also explore using debiased PLMs, such as the Zari models, which effectively reduce intrinsic bias in metrics like BERTScore and MoverScore.

Experimental Methodology:

Metric Evaluation:
- The paper uses datasets constructed for each sensitive attribute, containing minimally modified sentence pairs with stereotypical and anti-stereotypical expressions, to evaluate social bias in metrics.
Performance Retention:
- The proposed debiasing techniques show a notable reduction in social bias while maintaining high performance for tasks like machine translation and summarization, as evidenced by correlations with human judgments on datasets such as WMT20 and REALSumm.

Broader Implications:

The implications of this paper extend to the development and evaluation of generative systems. Socially biased evaluation metrics can propagate bias to generative models, impacting downstream tasks and potentially reinforcing stereotypes. By highlighting and addressing these biases, the paper offers pathways toward fairer and more equitable text generation technologies.

The authors provide public access to their code and datasets, which facilitates further research and development in mitigating social bias within natural language processing evaluations.

PDF Markdown

BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation (2210.07626v1)

Summary

Key Findings:

Experimental Methodology:

Broader Implications:

Related Papers