How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments (2403.11807v4)
Abstract: Decision-making is a complex process requiring diverse abilities, making it an excellent framework for evaluating LLMs. Researchers have examined LLMs' decision-making through the lens of Game Theory. However, existing evaluation mainly focus on two-player scenarios where an LLM competes against another. Additionally, previous benchmarks suffer from test set leakage due to their static design. We introduce GAMA($\gamma$)-Bench, a new framework for evaluating LLMs' Gaming Ability in Multi-Agent environments. It includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to quantitatively assess LLMs' performance. $\gamma$-Bench allows flexible game settings and adapts the scoring system to different game parameters, enabling comprehensive evaluation of robustness, generalizability, and strategies for improvement. Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought. We also evaluate twelve LLMs from six model families, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of $68.1$ out of $100$, followed by LLaMA-3.1-70B ($64.5$) and Mixtral-8x22B ($61.4$). All code and experimental results are publicly available via https://github.com/CUHK-ARISE/GAMABench.
- Evaluating multi-agent coordination abilities in large language models. arXiv preprint arXiv:2310.03903, 2023.
- Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pp. 337–371. PMLR, 2023.
- Playing repeated games with large language models. arXiv preprint arXiv:2305.16867, 2023.
- W Brian Arthur. Inductive reasoning and bounded rationality. The American economic review, 84(2):406–411, 1994.
- Generalized divide the dollar. In 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 343–350. IEEE, 2016.
- Education in the era of generative artificial intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching and learning. Journal of AI, 7(1):52–62, 2023.
- Playing games with gpt: What can we learn about a large language model from canonical strategic games? Available at SSRN 4493398, 2023.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Put your money where your mouth is: Evaluating strategic planning and execution of llm agents in an auction arena. arXiv preprint arXiv:2310.05746, 2023.
- Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations. arXiv preprint arXiv:2402.12348, 2024.
- Can large language models serve as rational players in game theory? a systematic analysis. arXiv preprint arXiv:2312.05488, 2023.
- The dynamics of social dilemmas. Scientific American, 270(3):76–81, 1994.
- Robert E Goodin. The theory of institutional design. Cambridge University Press, 1998.
- Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Fulin Guo. Gpt agents in game theory experiments. arXiv preprint arXiv:2305.05516, 2023.
- Suspicion-agent: Playing imperfect information games with theory of mind aware gpt-4. arXiv preprint arXiv:2309.17277, 2023.
- Strategic behavior of large language models: Game structure vs. contextual framing. Contextual Framing (September 10, 2023), 2023.
- John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023.
- Bernardo A. Huberman. The Ecology of Computation. North-Holland, 1988.
- Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 2023.
- Assessing the accuracy and reliability of ai-generated medical responses: an evaluation of the chat-gpt model. Research square, 2023.
- D Marc Kilgour. Equilibrium points of infinite sequential truels. International Journal of Game Theory, 6:167–180, 1977.
- The truel. Mathematics Magazine, 70(5):315–326, 1997.
- D Mark Kilgour. The sequential truel. International Journal of Game Theory, 4:151–174, 1975.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Better zero-shot reasoning with role-play prompting. arXiv preprint arXiv:2308.07702, 2023.
- Michal Kosinski. Theory of mind might have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
- Llm-based agent society investigation: Collaboration and confrontation in avalon gameplay. arXiv preprint arXiv:2310.14985, 2023.
- Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155, 2023.
- Alain Ledoux. Concours résultats complets. les victimes se sont plu à jouer le 14 d’atout. Jeux & Stratégie, 2(10):10–11, 1981.
- Beyond static datasets: A deep interaction approach to llm evaluation. arXiv preprint arXiv:2309.04369, 2023.
- Leveraging word guessing games to assess the intelligence of large language models. arXiv preprint arXiv:2310.20499, 2023a.
- Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023b.
- Alympics: Language agents meet game theory. arXiv preprint arXiv:2311.03220, 2023.
- Auctions and bidding. Journal of economic literature, 25(2):699–738, 1987.
- Roger B Myerson. Game theory. Harvard university press, 2013.
- Rosemarie Nagel. Unraveling in guessing games: An experimental study. The American economic review, 85(5):1313–1326, 1995.
- John F Nash. Equilibrium points in n-person games. Proceedings of the national academy of sciences, 36(1):48–49, 1950.
- John F Nash. Non-cooperative games. Annals of Mathematics, 54(2):286–295, 1951.
- Aidan O’Gara. Hoodwinked: Deception and cooperation in a text-based game for language models. arXiv preprint arXiv:2308.01404, 2023.
- Joseph Persky. Retrospectives: The ethology of homo economicus. Journal of Economic Perspectives, 9(2):221–231, 1995.
- Investigating emergent goal-like behaviour in large language models using experimental economics. arXiv preprint arXiv:2305.07970, 2023.
- Gameeval: Evaluating llms on conversational games. arXiv preprint arXiv:2308.10032, 2023.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476, 2023.
- Ariel Rubinstein. Instinctive and cognitive reasoning: A study of response times. The Economic Journal, 117(523):1243–1259, 2007.
- Paul A Samuelson. The pure theory of public expenditure. The review of economics and statistics, 36(4):387–389, 1954.
- Pure competition, coalitional power, and fair division. International Economic Review, 10(3):337–362, 1969.
- Long-horizon dialogue understanding for role identification in the game of avalon with large language models. arXiv preprint arXiv:2311.05720, 2023.
- Ian Stewart. A puzzle for pirates. Scientific American, 280(5):98–99, 1999.
- Nigar M Shafiq Surameery and Mohammed Y Shakor. Use chat gpt to solve programming bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN: 2455-5290, 3(01):17–22, 2023.
- Can large language models play text games well? current state-of-the-art and open questions. arXiv preprint arXiv:2304.02868, 2023.
- William Vickrey. Counterspeculation, auctions, and competitive sealed tenders. The Journal of finance, 16(1):8–37, 1961.
- Avalon’s game of thoughts: Battle against deception through recursive contemplation. arXiv preprint arXiv:2310.01320, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648, 2023a.
- Smartplay: A benchmark for llms as intelligent agents. arXiv preprint arXiv:2310.01557, 2023b.
- Magic: Investigation of large language model powered multi-agent in cognition, adaptability, rationality and collaboration. arXiv e-prints, pp. arXiv–2311, 2023a.
- Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658, 2023b.
- Language agents with reinforcement learning for strategic play in the werewolf game. arXiv preprint arXiv:2310.18940, 2023c.
- Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107, 2023.