Emergent Mind

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes

(2312.14890)
Published Dec 22, 2023 in cs.AI , cs.CC , cs.CL , and cs.LG

Abstract

Complex reasoning ability is one of the most important features of current LLMs, which has also been leveraged to play an integral role in complex decision-making tasks. Therefore, the investigation into the reasoning capabilities of LLMs is critical: numerous benchmarks have been established to assess the reasoning abilities of LLMs. However, current benchmarks are inadequate in offering a rigorous evaluation of the full extent of reasoning abilities that LLMs are capable of achieving. They are also prone to the risk of overfitting, as these benchmarks, being publicly accessible and static, allow models to potentially tailor their responses to specific benchmark metrics, thereby inflating their performance. Addressing these limitations, our research introduces a new benchmark, named NPHardEval. This benchmark is designed to evaluate the reasoning abilities of LLMs across a broad spectrum of 900 algorithmic questions, extending up to the NP-Hard complexity class. These questions are meticulously chosen to represent a wide range of complexity class below the NP-hard complexity class, offering a rigorous measure of the reasoning ability of LLMs. Through this study, we shed light on the current state of reasoning in LLMs, providing an objective and rigorous perspective through the comparison of LLMs' performance across complex classes. Moreover, this benchmark is designed with a dynamic update mechanism, where the datapoints are refreshed on a monthly basis. Such regular updates play a crucial role in mitigating the risk of LLMs overfitting to the benchmark, promoting a more accurate and reliable assessment of their reasoning capabilities. The benchmark dataset and code of NPHardEval are available at https://github.com/casmlab/NPHardEval.

Overview

  • Introduces NPHardEval, a dynamic benchmark for evaluating the reasoning abilities of LLMs using NP-Hard level questions.

  • Benchmark is designed to prevent overfitting by updating questions monthly and focusing on logical reasoning rather than mathematics.

  • NPHardEval consists of nine tasks, each tied to a specific complexity class and divided into ten difficulty levels to accurately gauge LLM performance.

  • Initial comparisons show closed-source models outperform open-source ones, with task difficulty inversely affecting accuracy and success.

  • Future plans include benchmark updates for relevancy and more complex evaluation frameworks to better assess LLM reasoning and learning abilities.

Introduction to the Evaluation Benchmark

In the landscape of AI, particularly in the capabilities of LLMs, reasoning ability stands as a critical attribute, especially as these models are increasingly employed in complex problem-solving domains. A new benchmark named NPHardEval has been introduced to evaluate reasoning abilities, involving 900 algorithmic questions reaching the NP-Hard complexity level. This dynamic benchmark is uniquely designed to circumvent the overfitting issues prevalent in static benchmarks by refreshing its questions on a monthly basis.

Task Design and Model Assessment

NPHardEval provides a finely-tuned structure of nine tasks, each categorized into specific complexity classes (P, NP-complete, and NP-hard), and subdivided into ten difficulty levels. This graded system of tasks not only captures the reasoning capacity of LLMs but also reflects the challenges encountered in real-world problem-solving across various industries. Moreover, the benchmark stands out with its automated generation and evaluation mechanisms that amplify the reliability and accuracy of assessments. The tasks chosen purposefully omit math-intensive problems, honing in on pure logical reasoning challenges.

Insights from Initial Findings

Upon comparing several LLMs using the NPHardEval benchmark, distinct patterns emerged. Closed-source models typically showed superior reasoning performance over open-source counterparts across all complexity classes, with a conspicuous trend of diminishing accuracy and increasing failure rates as task difficulty escalated. Notably, GPT-4 consistently performed well, suggesting its robustness in approaching complex tasks.

In-context Learning and Future Directions

Evaluating the models' ability to generalize from provided examples revealed a disparate picture. Closed-source models exhibited the potential to genuinely learn and apply algorithmic skills, as indicated by a consistent performance across varying example difficulties. On the other hand, open-source models often struggled, particularly when the examples were simpler than the test questions. These results underline not only the raw reasoning capabilities of LLMs but also their ability—or lack thereof—to learn in a broader sense.

Looking ahead, NPHardEval will deploy updates to maintain relevance in the fast-evolving LLM arena. The focus will be on enhancing the evaluation framework, for example, to better represent complexity or to integrate multi-model interactions. These enhancements will pave the way for more realistic assessments of LLM capabilities, providing invaluable insights for their advancement and application in demanding cognitive tasks.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.