Emergent Mind

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

(2407.09429)
Published Jul 12, 2024 in cs.CL

Abstract

Instruction-tuned LLMs can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups.

Discrepancy in AUROC score for clinical task performance based on different instruction phrasings.

Overview

  • The study investigates the sensitivity of LLMs in the clinical domain to variations in instruction phrasings, evaluating both general and clinical-specific models.

  • Using data from well-established datasets and prompts generated by medical professionals, the authors assess model performance and fairness across different demographic subgroups.

  • The findings highlight significant performance variability and demographic disparities, emphasizing the necessity for more robust and equitable LLMs in clinical settings.

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

The paper by Ceballos Arroyo et al. tackles a significant issue in the use of LLMs within the clinical domain: the sensitivity of these models to variations in instruction phrasings. The authors conduct a comprehensive evaluation of the robustness of seven different LLMs, both general and domain-specific (clinical), against instructions provided by medical professionals for various clinical tasks.

Introduction and Motivation

The study sets out to delve into the robustness of instruction-tuned LLMs when faced with natural variations in instruction phrasing, particularly within the context of clinical NLP tasks. The sensitivity of LLMs to how instructions are phrased is not a new discovery, but the implications in the healthcare domain, where clinician-written prompts can directly affect model outputs with potential consequences for patient outcomes, are particularly concerning.

Methodology

The authors designed an experimental setup encompassing ten clinical classification tasks and six information extraction tasks from well-established datasets such as MIMIC-III, i2b2, and n2c2 challenges. They recruited 20 medical professionals from diverse backgrounds to write prompts for each task. Ultimately, instructions from 12 practitioners were used to test the robustness of the seven LLMs. The workspace for the models included both general domain models (e.g., Llama 2, Alpaca) and domain-specific clinical models (e.g., Clinical Camel, Asclepius, MedAlpaca).

The evaluation of these models involved examining their performance across the range of provided prompts, translating this into metrics like mean and standard deviation for classification and information extraction tasks. Additionally, the authors explored the implications of instruction phrasings on the fairness of the models’ predictions across different demographic subgroups (e.g., race and sex) within the clinical tasks.

Key Findings

  1. Variability in Performance: The analysis revealed substantial differences in model performance across different instruction phrasings for both classification and extraction tasks. In particular, some degree of brittleness was observed for domain-specific models trained on clinical data, which contrasts with their general domain counterparts.

  2. Best vs. Worst Case Performance: The study showed that general models such as Llama 2 (7b) performed consistently better across various prompts compared to their clinical analogs. For instance, in the mortality prediction task, Llama 2 (13b) achieved superior results compared to Clinical Camel with less variability.

  3. Fairness: Evaluating fairness, the authors found significant discrepancies in model performance between different demographic subgroups. For instance, the mortality prediction task exhibited performance differences of up to 0.35 AUROC points between White and Non-White patients and up to 0.19 AUROC points between male and female patients.

Implications and Future Directions

The findings from this study have profound implications both practical and theoretical. Practically, the brittleness of clinical LLMs to instruction variations suggests that deploying these models in real-world settings could lead to inconsistent outcomes, influencing patient care based on arbitrary prompt differences. Theoretically, the work underscores a critical need for developing more robust LLMs capable of maintaining stable performance across varying natural language instructions.

From a fairness perspective, the study highlights the necessity to address demographic biases inherent in current LLMs. Given the differences in performance across race and sex, more research is needed to ensure equitable outcomes in clinical settings where bias can have far-reaching consequences.

Conclusion

The research presented by Ceballos Arroyo et al. offers a rigorous analysis of instruction-sensitive LLMs in clinical environments. The key takeaway is the evident lack of robustness in current models to instruction phrasings, which raises concerns for their applicability in high-stakes domains such as healthcare. Future efforts should be directed towards improving the stability and fairness of LLMs to ensure consistent and reliable performance irrespective of instruction variability. The findings point to an urgent need for developing advanced methodologies to enhance the resilience of LLMs, ultimately promoting safer and more reliable AI systems in clinical practice.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.