Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods (1804.06876v1)

Published 18 Apr 2018 in cs.CL and cs.AI

Abstract: We introduce a new benchmark, WinoBias, for coreference resolution focused on gender bias. Our corpus contains Winograd-schema style sentences with entities corresponding to people referred by their occupation (e.g. the nurse, the doctor, the carpenter). We demonstrate that a rule-based, a feature-rich, and a neural coreference system all link gendered pronouns to pro-stereotypical entities with higher accuracy than anti-stereotypical entities, by an average difference of 21.1 in F1 score. Finally, we demonstrate a data-augmentation approach that, in combination with existing word-embedding debiasing techniques, removes the bias demonstrated by these systems in WinoBias without significantly affecting their performance on existing coreference benchmark datasets. Our dataset and code are available at http://winobias.org.

Authors (5)

Jieyu Zhao (54 papers)
Tianlu Wang (33 papers)
Mark Yatskar (38 papers)
Vicente Ordonez (52 papers)
Kai-Wei Chang (292 papers)

Citations (846)

View on Semantic Scholar

Summary

The paper introduces WinoBias, a benchmark that quantifies gender bias with an average 21.1 F1 score gap between pro- and anti-stereotypical links.
It evaluates three coreference systems—rule-based, feature-rich, and neural—revealing significant disparities in handling stereotypical gender roles.
The study demonstrates that combining data augmentation with word embedding debiasing effectively reduces bias without adversely affecting standard performance.

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

Authors:

Jieyu Zhao
Tianlu Wang
Mark Yatskar
Vicente Ordonez
Kai-Wei Chang

Primary Institutions:

University of California, Los Angeles
University of Virginia
Allen Institute for Artificial Intelligence

Abstract Summary:

The paper introduces a novel benchmark, WinoBias, designed to evaluate gender bias in coreference resolution systems. The benchmark consists of Winograd-schema style sentences with gendered pronouns referring to individuals in various occupations. Evaluations of three different coreference systems—rule-based, feature-rich, and neural—reveal a substantial gender bias favoring pro-stereotypical over anti-stereotypical pronoun-entity links, quantified by an average F1 score difference of 21.1. The paper proposes a data augmentation approach combined with existent word-embedding debiasing techniques to mitigate this bias without adversely affecting the systems' performance on standard coreference benchmarks.

Introduction:

Coreference resolution, essential in various NLP applications, identifies mentions referring to the same entity within a text. Existing systems, including rule-based, feature-rich, and neural models, may unintentionally encode and propagate societal stereotypes present in training data. The WinoBias benchmark is constructed to critically evaluate this phenomenon by introducing sentences where pronouns need to be linked to stereotypical or anti-stereotypical gender roles based on U.S. Department of Labor statistics.

Key Contributions:

WinoBias Benchmark:
- Contains sentences involving 40 different occupations.
- Tests systems' performance on pro-stereotypical vs. anti-stereotypical pronoun resolutions.
- Average differences in F1 score performance indicate significant gender bias.
Evaluation of Existing Systems:
- Three representative coreference resolution systems evaluated:
  - Stanford Deterministic Coreference System
  - Berkeley Coreference Resolution System
  - UW End-to-End Neural Coreference Resolution System
- All three systems exhibited considerable gender bias, with the rule-based system being the most biased.
Debiasing Methods:
- Data augmentation: Gender swapping within the training data to balance gender representation.
- Bias correction in supporting resources like word embeddings and gender-statistics-based list for noun phrases.
- Both approaches proved effective individually and even more so when combined, eliminating bias in WinoBias evaluations without significantly harming overall performance metrics.

Practical and Theoretical Implications:

Practical Implications:

The procedures proposed for debiasing can be systematically applied to improve the fairness of various NLP models beyond coreference resolution.
Strategies like gender swapping in data augmentation can serve as blueprints for mitigating bias in underrepresented classes across different datasets.

Theoretical Implications:

The findings highlight the need to systematically address biases inherent in widely-used NLP datasets and supporting resources.
The research strengthens theoretical understanding of how biases are encoded and propagated through machine learning models, especially in structured prediction tasks.

Future Developments:

Several speculative areas for future research include:

Extending bias detection methods to other demographic attributes such as race, age, or social class.
Developing more sophisticated debiasing algorithms integrating causality and fairness principles within structured prediction tasks.
Exploring the impact of multilingual and culturally diverse datasets to understand and mitigate biases in a global context.

Overall, the paper offers a rigorous evaluation of gender bias in coreference resolution systems and presents effective debiasing techniques, contributing substantially to the robustness and fairness of NLP applications.

PDF Markdown

Related Papers

YouTube

Show All Videos