LLMs May Perform MCQA by Selecting the Least Incorrect Option

Published 2 Feb 2024 in cs.CL and cs.AI | (2402.01349v3)

Abstract: In the field of NLP, LLMs have markedly enhanced performance across a variety of tasks. However, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction. However, concerns regarding the robustness of this evaluative method persist. Building upon previous discussions on the issue of \textit{variability}, we reveal an additional dimension of concern: LLMs may perform MCQA by selecting the least incorrect option rather than distinctly correct. This observation suggests that LLMs might regard multiple options as correct, which could undermine the reliability of MCQA as a metric for evaluating LLMs. To address this challenge, we introduce an enhanced dataset augmentation method for MCQA, termed MCQA+, to provide a more accurate reflection of the model performance, thereby highlighting the necessity for more sophisticated evaluation mechanisms in the assessment of LLM capabilities.

Abstract PDF HTML Upgrade to Chat

References (46)

Citations (7)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs may choose the least incorrect option, casting doubt on MCQA’s reliability as a true measure of model understanding.
It reveals that LLM performance varies with reordered and altered answer options, suggesting an overfitting to traditional MCQA formats.
The introduction of MCQA+ offers a refined benchmark incorporating diverse question formats, aiming to capture more accurate assessments of LLM capabilities.

Evaluation of LLMs Through MCQA: A Critical Examination

This paper, produced by researchers at the Harbin Institute of Technology, engages in a meticulous critique of Multiple Choice Question Answering (MCQA) as a benchmark for evaluating LLMs. At the core of their research is a series of experimental analyses designed to expose the inadequacies inherent in employing MCQA as a sole metric for assessing the true capabilities of LLMs.

Examination of MCQA as a Benchmark

The paper begins by acknowledging the widespread adoption of LLMs like GPT-3, LLaMA, and ChatGPT, and highlights the challenges associated with accurately evaluating these models. The traditional evaluation metrics such as BLEU and ROUGE, while effective in certain contexts, often fail to capture the nuanced understanding required for tasks like commonsense reasoning and other MCQA-based evaluations used in LLM benchmarks, such as MMLU and Big Bench.

The researchers note that MCQA tasks usually consist of a singular question with multiple-choice options. The evaluation method assumes the model's capability to consistently choose the correct answer option, irrespective of the order of presentation. However, the researchers present experimental evidence indicating that when answer options are re-ordered, LLMs often exhibit inconsistencies in selecting the correct answer, calling into question the reliability of MCQA as a fixed benchmark.

Limitations and Variability in MCQA

Through a comprehensive set of experiments using datasets like MMLU and MedMCQA, the paper underscores the variability in LLM performance due to the alteration in the order and number of answer choices. A notable finding is the evidence of performance volatility when the number of options is modified. Results demonstrated an apparent "overfitting" of LLMs to the traditional format of four options, resulting in marked variability when the options count differed, thus exposing a potential flaw in logic or knowledge assessment by LLMs.

The paper discusses how LLMs may inaccurately interpret multiple options as correct but opt for the most plausible one rather than an exclusively correct answer. The study applied further testing through variations like True-or-False questions, revealing that LLMs often falter when encountering modified or complex reasoning tasks.

Introduction of MCQA+ as an Improved Benchmark

To address these challenges, the authors propose an augmented dataset termed MCQA+, aiming to deliver a more nuanced evaluation. MCQA+ includes additional variables such as re-ordered, expanded, and True-or-False formatted questions to better scrutinize LLM capabilities. Empirical evidence shows that performance on the MCQA+ dataset is generally inferior compared to the original, suggesting that traditional MCQA evaluations might be artificially inflated due to limitations in test design.

Implications and Future Directions

The critique and subsequent proposition of MCQA+ provides essential insights into the nuanced performance metrics required to evaluate LLMs meaningfully. The introduction of MCQA+ is indicative of an effort to refine LLM evaluation methodologies, ensuring that they consistently reflect true model capabilities and are not merely optimized for existing benchmarks.

In terms of practical implications, enhancing evaluation strategies fosters the development of more robust and adaptable NLP systems. By refining the benchmark metrics, future LLMs can be crafted with better understanding and reasoning capabilities that mirror human cognitive attributes more closely.

Overall, this critical examination of MCQA and the introduction of MCQA+ signify an incremental step towards refining the reliability and robustness of LLM evaluations, paving the way for more insightful and rigorous model assessments in the future. The work emphasizes the necessity for continuous examination and evolution of benchmarks, reflecting the ongoing growth and complexity of artificial intelligence.

Markdown Report Issue