Emergent Mind

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

(2403.12025)
Published Mar 18, 2024 in cs.CY , cs.CL , and cs.LG

Abstract

LLMs hold immense promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. In this work, we present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and then conduct an empirical case study with Med-PaLM 2, resulting in the largest human evaluation study in this area to date. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries. Through our empirical study, we find that the use of a collection of datasets curated through a variety of methodologies, coupled with a thorough evaluation protocol that leverages multiple assessment rubric designs and diverse rater groups, surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. We emphasize that while our framework can identify specific forms of bias, it is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes. We hope the broader community leverages and builds on these tools and methods towards realizing a shared goal of LLMs that promote accessible and equitable healthcare for all.

Iterative approach to design rubrics and introduce EquityMedQA datasets for testing health equity biases in LLMs.

Overview

  • The paper introduces a framework and datasets designed to identify and evaluate biases and health equity issues in LLMs used in healthcare.

  • An iterative, participatory approach led to the creation of multifactorial assessment rubrics for bias evaluation, alongside an empirical case study with Med-PaLM 2.

  • EquityMedQA datasets, comprising seven distinct sets of medical queries, aim to facilitate the adversarial testing of health equity concerns in medical LLMs.

  • Key findings from an empirical study emphasize the presence of biases in LLM outputs, the importance of diverse rater perspectives, and the utility of counterfactual analysis in uncovering subtle biases.

Introducing a Framework and Datasets for Evaluating Health Equity Harms in LLMs

Overview of Proposed Framework and Datasets

The utilization of LLMs in healthcare has demonstrated considerable potential in enhancing access to medical information and improving patient care. However, alongside the opportunities, there exist significant challenges, particularly concerning the perpetuation of biases and exacerbation of health disparities. Addressing these challenges requires a systematic approach to evaluate and identify biases embedded within LLM-generated content. In this context, the paper presents a comprehensive framework alongside a collection of newly-released datasets aimed at surfacing biases related to health equity in the outputs of medical LLMs. This effort, grounded in an iterative and participatory approach, encompasses multifactorial assessment rubrics for bias evaluation and an empirical case study with Med-PaLM 2, contributing valuable insights into the identification and mitigation of equity-related harms in LLMs.

Multifactorial Assessment Rubrics

The assessment rubrics detailed in this paper were designed to evaluate bias within LLM-generated answers to medical queries. They incorporate dimensions of bias developed in collaboration with equity experts, reflecting a nuanced approach to understanding bias beyond conventional metrics. Three types of rubrics are introduced:

  • Independent Assessment: Evaluates bias in a single answer to a question, allowing raters to identify various forms of bias including inaccuracies across identity axes, lack of inclusivity, and stereotyping.
  • Pairwise Assessment: Compares the presence or degree of bias between two answers to a single question, providing a relative measure of bias between model outputs.
  • Counterfactual Assessment: Focuses on answers to pairs of questions that differ only by identifiers of demographics or other context, helping identify biases introduced by changes in the specified identities or contexts.

EquityMedQA Datasets

The EquityMedQA comprises seven datasets designed to facilitate the adversarial testing of health equity issues within medical LLMs. These datasets span various aspects of medical information queries, from explicitly adversarial questions to inquiries enriched for content related to known health disparities. The diversity in the collection methodology, including human curation, LLM-generated queries, and focus on global health topics, underscores the comprehensive nature of these datasets in targeting different forms of potential bias. Notably, the datasets include:

  • OMAQ: Features human-curated, explicitly adversarial queries across multiple health topics.
  • EHAI: Targets implicitly adversarial queries related to health disparities in the United States.
  • FBRT-Manual and FBRT-LLM: Contain questions derived through failure-based red teaming of Med-PaLM 2.
  • TRINDS: Centers on tropical and infectious diseases, emphasizing the global context.
  • CC-Manual and CC-LLM: Include counterfactual query pairs with adjustments for identity or context, aiding in a deeper understanding of bias generation.

Empirical Results and Implications

Through an extensive empirical study utilizing the developed rubrics and datasets, several key findings emerged:

  • Bias in LLM Outputs: The study revealed biases within Med-PaLM 2 outputs across multiple dimensions, indicating the necessity of diverse methodologies in bias evaluation.
  • Role of Rater Groups: Variation in bias reporting between physician, health equity expert, and consumer rater groups highlighted the importance of including diverse perspectives in bias evaluation efforts.
  • Utility of Counterfactual Analysis: The counterfactual assessment rubric elucidated biases related to changes in demographic identifiers or context, offering profound insights into subtle forms of bias.

Concluding Remarks

The proposed framework and datasets mark a significant advancement in the ongoing efforts to mitigate health equity harms within medical LLMs. The results underscore the multifaceted nature of bias in LLM outputs and the critical need for diverse evaluative approaches and stakeholder engagement. Future research directions include refining the evaluation rubrics, extending the datasets to cover wider global contexts, and developing methodologies to mitigate identified biases effectively.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.