GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts (2305.12477v2)

Published 21 May 2023 in cs.CL and cs.AI

Abstract: LLMs have exhibited remarkable performance on various NLP tasks. However, there is a current hot debate regarding their reasoning capacity. In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks across eleven distinct datasets. Our paper provides empirical evidence showcasing the superior performance of ChatGPT-4 in comparison to both ChatGPT-3.5 and BARD in zero-shot setting throughout almost all evaluated tasks. While the superiority of GPT-4 compared to GPT-3.5 might be explained by its larger size and NLP efficiency, this was not evident for BARD. We also demonstrate that the three models show limited proficiency in Inductive, Mathematical, and Multi-hop Reasoning Tasks. To bolster our findings, we present a detailed and comprehensive analysis of the results from these three models. Furthermore, we propose a set of engineered prompts that enhances the zero-shot setting performance of all three models.

References (64)

Citations (74)

View on Semantic Scholar

Summary

The paper demonstrates GPT-4's consistent superiority over GPT-3.5 and BARD in diverse reasoning tasks while highlighting shared limitations in inductive, mathematical, and multi-hop reasoning.
The paper introduces tailored prompt engineering techniques that significantly enhance zero-shot reasoning performance across evaluated models.
The paper ensures reproducibility by releasing publicly available datasets and test suites, encouraging further research in LLM reasoning capabilities.

Evaluation of Reasoning Abilities of LLMs in Zero-Shot Settings

The paper "GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts" addresses a pivotal concern in the current landscape of NLP research: the reasoning capabilities of LLMs. As LLMs like GPT-3.5, GPT-4, and Google's BARD continue to outperform in traditional NLP tasks, the ability of these models to perform reasoning tasks remains contentious. This paper rigorously evaluates the reasoning capabilities of these models in a zero-shot setting using a broad array of reasoning tasks spanning deductive, inductive, abductive, commonsense, causal, and multi-hop reasoning through evaluations across eleven distinct datasets.

Summary of Findings

Evaluation Across Reasoning Tasks: The paper employs a comprehensive methodological framework to assess the performance of GPT-3.5, GPT-4, and BARD over a suite of eleven datasets designed to challenge various types of reasoning. Notably, the evaluation results reveal that ChatGPT-4 consistently outperforms both GPT-3.5 and BARD across most reasoning categories. However, a common limitation is observed in Inductive, Mathematical, and Multi-hop Reasoning Tasks where performance improves marginally or remains constrained across models.
Prompt Engineering: The authors propose a set of engineered prompts tailored to enhance the models' performance in a zero-shot setting. Empirical evidence from the experiments indicates that these engineered prompts significantly improve the models' reasoning performance, suggesting the potential of strategic prompting to unlock latent reasoning capabilities in LLMs.
Reproducibility and Public Availability: Unlike prior studies, this research emphasizes transparency and reproducibility by making samples publicly available and ensuring that the test suite can be fully reproduced on all three evaluated models. This openness facilitates further exploration and model comparison within the research community.

Implications and Future Directions

Theoretical Implications: The findings highlight the stratified reasoning abilities among different LLMs, correlating model size and architecture with performance outcomes. This nuanced understanding aids in refining theories surrounding model scaling, data-driven learning, and reasoning proficiency within machine learning frameworks.
Practical Applications: Given the limitations exhibited in tasks requiring nuanced multi-step logic or abstract inference, future endeavors should focus on integrating reasoning-enhancing architectures or specialized training datasets aimed at addressing these deficits.
Speculative Future of AI: The paper suggests a direction towards better reasoning through improved prompting techniques. As a consequence, the research community might explore a hybrid approach combining enhanced CoT techniques, rationale engineering, or rationale verification strategies to enable more coherent logical processing within models.

The paper provides an empirical benchmark, offering an insightful exploration into the reasoning capabilities of LLMs. As AI continues its trajectory towards autonomous reasoning, this paper underscores the critical importance of interdisciplinary research efforts aimed at bridging the chasm between symbolic and statistical reasoning paradigms in AI systems.

PDF Markdown

GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts (2305.12477v2)

Summary

Evaluation of Reasoning Abilities of LLMs in Zero-Shot Settings

Summary of Findings

Implications and Future Directions

Related Papers