Open (Clinical) LLMs are Sensitive to Instruction Phrasings (2407.09429v1)

Published 12 Jul 2024 in cs.CL

Abstract: Instruction-tuned LLMs can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups.

Authors (7)

Alberto Mario Ceballos Arroyo (1 paper)
Monica Munnangi (4 papers)
Jiuding Sun (11 papers)
Karen Y. C. Zhang (2 papers)
Denis Jered McInerney (9 papers)
Byron C. Wallace (82 papers)
Silvio Amir (16 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that clinical LLMs exhibit significant performance fluctuations with different instruction phrasings, raising robustness concerns.
The study used 10 classification and 6 extraction tasks with prompts from 12 practitioners across 7 models to show that general LLMs often outperform domain-specific ones.
The evaluation uncovers marked demographic disparities in outcomes, emphasizing the need to enhance fairness and reliability in clinical AI applications.

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

The paper by Ceballos Arroyo et al. tackles a significant issue in the use of LLMs within the clinical domain: the sensitivity of these models to variations in instruction phrasings. The authors conduct a comprehensive evaluation of the robustness of seven different LLMs, both general and domain-specific (clinical), against instructions provided by medical professionals for various clinical tasks.

Introduction and Motivation

The paper sets out to delve into the robustness of instruction-tuned LLMs when faced with natural variations in instruction phrasing, particularly within the context of clinical NLP tasks. The sensitivity of LLMs to how instructions are phrased is not a new discovery, but the implications in the healthcare domain, where clinician-written prompts can directly affect model outputs with potential consequences for patient outcomes, are particularly concerning.

Methodology

The authors designed an experimental setup encompassing ten clinical classification tasks and six information extraction tasks from well-established datasets such as MIMIC-III, i2b2, and n2c2 challenges. They recruited 20 medical professionals from diverse backgrounds to write prompts for each task. Ultimately, instructions from 12 practitioners were used to test the robustness of the seven LLMs. The workspace for the models included both general domain models (e.g., Llama 2, Alpaca) and domain-specific clinical models (e.g., Clinical Camel, Asclepius, MedAlpaca).

The evaluation of these models involved examining their performance across the range of provided prompts, translating this into metrics like mean and standard deviation for classification and information extraction tasks. Additionally, the authors explored the implications of instruction phrasings on the fairness of the models’ predictions across different demographic subgroups (e.g., race and sex) within the clinical tasks.

Key Findings

Variability in Performance: The analysis revealed substantial differences in model performance across different instruction phrasings for both classification and extraction tasks. In particular, some degree of brittleness was observed for domain-specific models trained on clinical data, which contrasts with their general domain counterparts.
Best vs. Worst Case Performance: The paper showed that general models such as Llama 2 (7b) performed consistently better across various prompts compared to their clinical analogs. For instance, in the mortality prediction task, Llama 2 (13b) achieved superior results compared to Clinical Camel with less variability.
Fairness: Evaluating fairness, the authors found significant discrepancies in model performance between different demographic subgroups. For instance, the mortality prediction task exhibited performance differences of up to 0.35 AUROC points between White and Non-White patients and up to 0.19 AUROC points between male and female patients.

Implications and Future Directions

The findings from this paper have profound implications both practical and theoretical. Practically, the brittleness of clinical LLMs to instruction variations suggests that deploying these models in real-world settings could lead to inconsistent outcomes, influencing patient care based on arbitrary prompt differences. Theoretically, the work underscores a critical need for developing more robust LLMs capable of maintaining stable performance across varying natural language instructions.

From a fairness perspective, the paper highlights the necessity to address demographic biases inherent in current LLMs. Given the differences in performance across race and sex, more research is needed to ensure equitable outcomes in clinical settings where bias can have far-reaching consequences.

Conclusion

The research presented by Ceballos Arroyo et al. offers a rigorous analysis of instruction-sensitive LLMs in clinical environments. The key takeaway is the evident lack of robustness in current models to instruction phrasings, which raises concerns for their applicability in high-stakes domains such as healthcare. Future efforts should be directed towards improving the stability and fairness of LLMs to ensure consistent and reliable performance irrespective of instruction variability. The findings point to an urgent need for developing advanced methodologies to enhance the resilience of LLMs, ultimately promoting safer and more reliable AI systems in clinical practice.

PDF Markdown

Related Papers

Tweets

https://twitter.com/monicamreddy/status/1813618705789853713

https://twitter.com/biosoong/status/1814904695389102196