Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Published 18 Jun 2024 in cs.CL | (2406.12809v1)

Abstract: LLMs have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.

Abstract PDF HTML Upgrade to Chat

Authors (7)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces the ConsisEval benchmark, pairing questions by difficulty across multiple domains to assess LLM performance.
The paper proposes a Consistency Score, quantifying how often LLMs maintain accuracy from hard to easier tasks.
The paper finds that training on challenging data improves model consistency, indicating targeted learning can enhance LLM reliability.

An Analysis of Hard-to-Easy Inconsistency in LLMs

The paper "Can LLMs Always Solve Easy Problems if They Can Solve Harder Ones?" addresses an intriguing inconsistency observed in the behavior of LLMs: the phenomenon where LLMs, despite their capability to solve difficult problems, sometimes fail at solving easier tasks. The authors develop the ConsisEval benchmark to systematically evaluate this inconsistency, termed as "hard-to-easy inconsistency," and introduce novel metrics to measure it.

Key Contributions and Findings

ConsisEval Benchmark:
- The ConsisEval is specifically constructed to evaluate the hard-to-easy consistency of LLMs. It includes data from three domains: instruction-following, code, and mathematics. Each entry in the benchmark consists of a pair of questions ordered by difficulty.
Consistency Score:
- The authors propose the Consistency Score (CS) as a metric to quantitatively assess the probability of consistency from a probabilistic stance. The CS represents the conditional probability of a model correctly answering easy questions that follows correctly solving more difficult ones.
Empirical Evaluation:
- The paper presents an extensive evaluation of various existing LLMs using ConsisEval. GPT-4 demonstrated the highest CS of 92.2%, indicating a relatively strong hard-to-easy consistency. However, specific examples in the results revealed that even the most advanced models like GPT-4 can exhibit inconsistency due to factors such as redundant information or misinterpretation of questions.
Correlation Between Model Capability and Consistency:
- While it is generally observed that models with stronger capabilities tend to exhibit higher consistency, there are notable exceptions. Some high-capability models still manifest poor consistency, underscoring a complex interplay between general accuracy and consistency.
Training on Hard Data:
- Findings show that exposure to harder training data enhances the consistency of models. This improvement is noted both in fine-tuning stages and during in-context learning, suggesting that difficulty in training data plays a critical role in model consistency.

Implications and Future Directions

The observed hard-to-easy inconsistency poses significant implications for the trustworthiness and reliability of LLMs, especially in applications requiring dependable and consistent outputs. By highlighting this unexplored aspect of model behavior, the paper paves the way for future research focused on consistency improvements. This involves developing methodologies to enhance model reasoning in easier contexts without compromising performance on complex tasks.

From a theoretical standpoint, reevaluating model architectures or training paradigms might be essential to address low consistency scores. Practically, training regimes that incorporate a balanced mix of hard and easy tasks could mitigate inconsistencies, allowing for more robust AI systems.

Further research is needed to identify specific types of questions or contexts that frequently cause inconsistencies, enabling targeted improvements. Exploration of consistency could extend beyond LLMs into other AI domains where similar behavioral patterns might exist.

In conclusion, the paper effectively positions hard-to-easy consistency as a significant yet under-explored metric for LLMs, setting a foundation for future advancements in AI reliability and trustworthiness. While current models exhibit commendable capability, achieving seamless consistency remains a challenge, necessitating ongoing inquiry and innovation. The provision of publicly available data and code underscores the paper's contribution to fostering collaborative research in this area.

Markdown Report Issue