Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark

Published 19 Feb 2024 in cs.CL | (2402.11924v5)

Abstract: While LLMs excel in question-answering (QA) tasks, their real reasoning abilities on multiple evidence retrieval and integration on Multi-hop QA tasks remain less explored. Firstly, LLMs sometimes generate answers that rely on internal memory rather than retrieving evidence and reasoning in the given context, which brings concerns about the evaluation quality of real reasoning abilities. Although previous counterfactual QA benchmarks can separate the internal memory of LLMs, they focus solely on final QA performance, which is insufficient for reporting LLMs' real reasoning abilities. Because LLMs are expected to engage in intricate reasoning processes that involve evidence retrieval and answering a series of sub-questions from given passages. Moreover, current factual Multi-hop QA (MHQA) benchmarks are annotated on open-source corpora such as Wikipedia, although useful for multi-step reasoning evaluation, they show limitations due to the potential data contamination in LLMs' pre-training stage. To address these issues, we introduce a Step-wise Counterfactual benchmark (CofCA), a novel evaluation benchmark consisting of factual data and counterfactual data that reveals LLMs' real reasoning abilities on multi-step reasoning and reasoning chain evaluation. Our experimental results reveal a significant performance gap of several LLMs between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks. Moreover, we observe that LLMs usually bypass the correct reasoning chain, showing an inflated multi-step reasoning performance. We believe that our CofCA benchmark will enhance and facilitate the evaluations of trustworthy LLMs.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper reveals a significant performance gap, showing how LLMs score lower on the counterfactually edited dataset compared to traditional benchmarks.
The paper demonstrates that LLMs often follow flawed reasoning chains, with examples like a 36.3% accuracy in correct chain evaluation.
The paper proposes a joint evaluation metric that combines intermediate and final answer assessments to better capture multi-hop reasoning challenges.

Evaluation of LLMs Through Multi-Hop Reasoning and Knowledge Editing

The paper "MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition" addresses an essential dimension of evaluating LLMs: their multi-hop reasoning ability in question answering tasks. Despite the pronounced capabilities of LLMs in multi-hop question answering (MHQA) scenarios, the authors argue that these models' genuine reasoning abilities remain inadequately explored due to conventional evaluation limitations. The authors introduce a novel benchmark, MRKE, intended to overcome current limitations by editing the established HotpotQA dataset and integrating evaluation of reasoning chains.

Key Contributions and Findings

The primary contributions of this research include the development of a new MHQA benchmark by editing HotpotQA to incorporate new and unprecedented knowledge through knowledge editing. This approach intends to mitigate challenges such as data contamination, which arises when evaluation datasets may have been exposed to the models during pretraining, thereby potentially inflating model performance metrics. Additionally, this work emphasizes the assessment of reasoning chains via sub-questions and their corresponding intermediate answers.

Performance Gap: The study finds a notable performance disparity when models are evaluated on the edited dataset compared to the original HotpotQA. For instance, GPT-4 demonstrates significantly reduced exact match (EM) and F1 scores on the MRKE dataset (53.2 EM and 67.7 F1 on two-hop questions) as opposed to the original HotpotQA dataset (69.3 EM and 82.2 F1). This performance gap underscores the potential data contamination in traditional MHQA datasets and suggests that LLMs' reasoning abilities are being overstated.
Reasoning Chain Evaluation: The paper introduces reasoning chain evaluation to assess whether LLMs follow the correct reasoning process to arrive at their answers. For example, GPT-4 only manages a 36.3% accuracy on the correct reasoning chain across the dataset. This finding implies that while LLMs may arrive at correct answers, they do not consistently follow the accurate reasoning path, indicating a reliance on memory or heuristic shortcuts.
Joint Evaluation Metric: To better capture the interplay of question complexity and reasoning capabilities, the paper proposes a new evaluation metric that combines intermediate and final answer assessment. The results reveal that models experience diminished performance with increasing multi-hop complexity, highlighting the need for more robust reasoning strategies in LLMs.

Implications

The MRKE benchmark represents a significant step toward more accurately evaluating LLMs' reasoning abilities in MHQA tasks. By isolating reasoning capabilities from memorization, the study provides a new lens to understand and improve LLM performance. The findings suggest a critical need for developing techniques and models that enhance reasoning pathways rather than merely improving final answer generation.

Future Directions

Given the highlighted discrepancies in reasoning abilities, future work should focus on developing LLM architectures and training regimes that emphasize reasoning chains and procedural correctness. Additionally, extending the scope of MRKE and similar benchmarks to other LLMs and domains could further elucidate the reasoning dynamics at play. Investigating methods to dynamically update benchmarks in response to LLMs' evolving training datasets could also mitigate the risk of data contamination.

In summary, the paper offers a compelling methodology for evaluating the reasoning capabilities of LLMs using a refined multi-hop QA approach, presenting important insights into both current limitations and future prospects in LLM development.

Markdown Report Issue