GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts (2305.12477v2)
Abstract: LLMs have exhibited remarkable performance on various NLP tasks. However, there is a current hot debate regarding their reasoning capacity. In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks across eleven distinct datasets. Our paper provides empirical evidence showcasing the superior performance of ChatGPT-4 in comparison to both ChatGPT-3.5 and BARD in zero-shot setting throughout almost all evaluated tasks. While the superiority of GPT-4 compared to GPT-3.5 might be explained by its larger size and NLP efficiency, this was not evident for BARD. We also demonstrate that the three models show limited proficiency in Inductive, Mathematical, and Multi-hop Reasoning Tasks. To bolster our findings, we present a detailed and comprehensive analysis of the results from these three models. Furthermore, we propose a set of engineered prompts that enhances the zero-shot setting performance of all three models.
- Mindreaders: the cognitive basis of" theory of mind". Psychology Press.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 .
- Abductive commonsense reasoning. 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 .
- PIQA: reasoning about physical commonsense in natural language. CoRR abs/1911.11641.
- Language models are few-shot learners, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv abs/2303.12712.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 .
- Palm: Scaling language modeling with pathways. ArXiv abs/2204.02311.
- Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems 30. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 .
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 .
- Explaining answers with entailment trees. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing .
- Mathematics, word problems, common sense, and artificial intelligence. arxiv doi:10.48550/arXiv.2301.09723.
- Improving the teaching of hypothesis testing using a divide-and-conquer strategy and content exposure control in a gamified environment. Mathematics 8, 2244.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 .
- e-CARE: a new dataset for exploring explainable causal reasoning. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) .
- Mathematical reasoning: Analogies, metaphors, and images. Routledge.
- Mathematical capabilities of chatgpt. arxiv doi:10.48550/arXiv.2301.13867.
- Mathematical capabilities of chatgpt. arXiv .
- Complexity-based prompting for multi-step reasoning. ArXiv abs/2210.00720.
- How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597 .
- Inductive reasoning. Wiley interdisciplinary reviews: Cognitive science 1, 278–292.
- Properties of inductive reasoning. Psychonomic Bulletin & Review 7, 569–592.
- Training Compute-Optimal Large Language Models. arXiv e-prints .
- Towards reasoning in large language models: A survey. arXiv:2212.10403 .
- Large language models can self-improve. arXiv preprint arXiv:2210.11610 .
- Deductive reasoning. Wiley Interdisciplinary Reviews: Cognitive Science 1, 8–17.
- Designing effective supports for causal reasoning. Educational Technology Research and Development 56, 287–308. doi:10.1007/s11423-006-9021-6.
- Decomposed prompting: A modular approach for solving complex tasks. The Eleventh International Conference on Learning Representations URL: https://openreview.net/forum?id=_nGgzQjzaRy.
- Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems URL: https://arxiv.org/pdf/2205.11916.pdf.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems URL: https://openreview.net/forum?id=IFXTZERXdM7.
- On the advance of making language models better reasoners. arXiv doi:10.48550/arXiv.2206.02336.
- What makes good in-context examples for GPT-3? Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures .
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55. URL: https://doi.org/10.1145/3560815, doi:10.1145/3560815.
- Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627 .
- An overview of bard: an early experiment with generative ai .
- OpenIA, 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
- OpenIA, 2023a. Gpt-4 technical report. arxiv URL: https://arxiv.org/pdf/2303.08774.pdf.
- OpenIA, 2023b. Openia. https://openai.com/.
- Training language models to follow instructions with human feedback. ArXiv abs/2203.02155.
- Approaches to abductive reasoning: an overview. Artificial intelligence review 7, 109–152.
- Improving language understanding by generative pre-training. arxiv .
- Language models are unsupervised multitask learners. arxiv URL: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv e-prints , arXiv:2112.11446doi:10.48550/arXiv.2112.11446, arXiv:2112.11446.
- Commonsense reasoning for natural language processing, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, Association for Computational Linguistics, Online. pp. 27--33. URL: https://aclanthology.org/2020.acl-tutorials.7, doi:10.18653/v1/2020.acl-tutorials.7.
- Analysing mathematical reasoning abilities of neural models. International Conference on Learning Representations URL: https://openreview.net/forum?id=H1gR5iR5FX.
- The evolution of mathematical reasoning: Everyday versus idealized understandings. Developmental Review 22, 242--266.
- Beyond inductive and deductive reasoning: The search for a sense of knowing. Educational Studies in mathematics 30, 197--210.
- CLUTRR: A diagnostic benchmark for inductive reasoning from text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) .
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) .
- Lamda: Language models for dialog applications. arXiv .
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 .
- Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. arXiv preprint arXiv:1905.07374 .
- Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). Workshop at Neural Information Processing System URL: https://arxiv.org/pdf/2206.10498.pdf.
- Abductive reasoning. University of Alabama Press.
- Modeling semantic plausibility by injecting world knowledge. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) .
- Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations URL: https://openreview.net/forum?id=1PL1NIMMrw.
- Emergent abilities of large language models. Transactions on Machine Learning Research .
- Chain of thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems URL: https://openreview.net/forum?id=_VjQlMeSB_J.
- Towards ai-complete question answering: A set of prerequisite toy tasks. 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings .
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing .
- STar: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems URL: https://openreview.net/forum?id=_3ELRdg2sgI.
- Automatic chain of thought prompting in large language models. The Eleventh International Conference on Learning Representations URL: https://openreview.net/forum?id=5NTt8GFjUHkr.
- Teaching algorithmic reasoning via in-context learning. URL: https://openreview.net/forum?id=6dlC7E1H_9.