Large Language Models Are Not Robust Multiple Choice Selectors

Published 7 Sep 2023 in cs.CL | (2309.03882v4)

Abstract: Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of LLMs. This work shows that modern LLMs are vulnerable to option position changes in MCQs due to their inherent "selection bias", namely, they prefer to select specific option IDs as answers (like "Option A"). Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs' token bias, where the model a priori assigns more probabilistic mass to specific option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. To mitigate selection bias, we propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution. PriDe first estimates the prior by permutating option contents on a small number of test samples, and then applies the estimated prior to debias the remaining samples. We demonstrate that it achieves interpretable and transferable debiasing with high computational efficiency. We hope this work can draw broader research attention to the bias and robustness of modern LLMs.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (143)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs show inherent selection bias in multiple choice tasks due to token bias rather than position bias.
It systematically evaluates 20 models on MMLU, ARC-Challenge, and CommonsenseQA, revealing vulnerabilities consistent across domains.
It introduces PriDe, an efficient debiasing strategy that neutralizes token bias using prior estimation without requiring labeled data.

Analysis of "LLMs Are Not Robust Multiple Choice Selectors"

In the field of natural language processing, LLMs frequently exhibit vulnerabilities when tasked with multiple choice questions (MCQs). The paper "LLMs Are Not Robust Multiple Choice Selectors" meticulously analyzes this phenomenon, arguing that LLMs demonstrate selection bias due to a predisposition towards specific option identifiers.

Key Findings and Methodology

The study identifies a significant selection bias in LLMs, which leads to susceptibility in handling option permutations within MCQs. This exposes a behavioral tendency to favor particular option IDs (such as "Option A"), underlining token bias as the primary driver rather than position bias. The latter assumption posits that models may have preferential inclinations towards options based on their ordinal placement, which is less prevalent according to this study's findings.

A series of 20 LLMs, spanning acclaimed models from specific families, were systematically evaluated over datasets endemic to three MCQ benchmarks: MMLU, ARC-Challenge, and CommonsenseQA. The empirical observations highlighted that LLMs consistently displayed this bias irrespective of domain variations, which suggests an intrinsic model behavior rather than data-dependent factors.

PriDe - A Debiasing Strategy

To address this bias, the authors propose PriDe (Debiasing with Prior estimation), a mitigation method that efficiently separates token bias at inference time without necessitating labeled data. PriDe operates by estimating prior biases through option permutation across a small subset of samples. The estimated prior is subsequently employed to neutralize biases in the remainder of the dataset, achieving equivalently effective to permutation-based debiasing but with significantly lower computational demands.

Notably, PriDe showed interpretable debiasing capabilities, proving its robustness and efficiency across model families. The technique's cross-domain transferability further accentuates its practical potential, making PriDe a valuable tool for researchers and practitioners needing to enhance the stability and fairness of LLM selections in automated evaluative scenarios.

Implications and Future Directions

The implications of these findings stress the necessity for more refined strategies to enhance LLM robustness, particularly in automated testing and evaluation. PriDe emerges as an effective methodological advance that not only diagnoses inherent biases but also provides a pragmatic solution with computational efficiency. As LLMs continue to proliferate across diverse applications, ensuring their robustness remains a priority.

Further exploration into the underlying causes of position bias and refinement of debiasing techniques will be essential. The possibility of integrating PriDe with other models or confronting domain-specific nuances furnishes a promising avenue for advancing LLM reliability. Researchers should consider these findings when developing models with inherent bias mitigation strategies embedded within their architecture, ensuring LLMs are not only powerful but impartial and fair in decision tasks.

In conclusion, this paper articulates a robust framework for understanding and addressing the selection bias in LLMs within the context of multiple choice selectors. By not only identifying but also proposing credible solutions to these biases, the authors set the stage for subsequent advancements in LLM research and application.

Markdown Report Issue