GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations (2402.12348v2)
Abstract: As LLMs are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we (1) Characterize the game-theoretic reasoning of LLMs; and (2) Perform LLM-vs.-LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; (2) Most open-source LLMs, e.g., CodeLlama-34b-Instruct and Llama-2-70b-chat, are less competitive than commercial LLMs, e.g., GPT-4, in complex games, yet the recently released Llama-3-70b-Instruct makes up for this shortcoming. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. We further characterize the game-theoretic properties of LLMs, such as equilibrium and Pareto Efficiency in repeated games. Detailed error profiles are provided for a better understanding of LLMs' behavior. We hope our research provides standardized protocols and serves as a foundation to spur further explorations in the strategic reasoning of LLMs.
- Llm-deliberation: Evaluating llms with interactive multi-agent negotiation games, 2023.
- Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023.
- Securebert: A domain-specific language model for cybersecurity. In International Conference on Security and Privacy in Communication Systems, pp. 39–56. Springer, 2022.
- Playing repeated games with large language models, 2023.
- Cybert: Cybersecurity claim classification by fine-tuning the bert language model. Journal of Cybersecurity and Privacy, 1(4):615–637, 2021.
- Axelrod, R. The emergence of cooperation among egoists. American political science review, 75(2):306–318, 1981.
- Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378:1067 – 1074, 2022. URL https://api.semanticscholar.org/CorpusID:253759631.
- Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.
- Monte-carlo tree search: A new framework for game ai. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 4, pp. 216–217, 2008.
- Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023.
- Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pp. 41–75. Springer, 2019.
- Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.
- Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022. doi: 10.1126/science.ade9097. URL https://www.science.org/doi/abs/10.1126/science.ade9097.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
- Strategic reasoning with language models, 2023.
- Mindagent: Emergent gaming interaction, 2023.
- Interactive fiction games: A colossal adventure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 7903–7910, 2020.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a.
- Large language model for causal decision making. arXiv preprint arXiv:2312.17122, 2023b.
- Openspiel: A framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453, 2019.
- Camel: Communicative agents for ”mind” exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023.
- Avalonbench: Evaluating llms playing the game of avalon. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Agentbench: Evaluating llms as agents, 2023.
- Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1384–1403, 2022.
- Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
- Welfare diplomacy: Benchmarking language model cooperation. In Socially Responsible Language Modelling Research, 2023.
- O’Gara, A. Hoodwinked: Deception and cooperation in a text-based game for language models, 2023.
- Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=8aHzds2uUyB.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Alfworld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2020.
- Long-horizon dialogue understanding for role identification in the game of avalon with large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- Scienceworld: Is your agent smarter than a 5th grader? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11279–11298, 2022a.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022b.
- Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv preprint arXiv:2309.10691, 2023b.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
- Examining the inter-consistency of large language models: An in-depth analysis via debate. Association for Computational Linguistics, 2023.
- Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658, 2023a.
- Language agents with reinforcement learning for strategic play in the werewolf game, 2023b.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
- Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Fireball: A dataset of dungeons and dragons actual-play with structured game state information. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.229. URL http://dx.doi.org/10.18653/v1/2023.acl-long.229.
- Regret minimization in games with incomplete information. Advances in neural information processing systems, 20, 2007.