Emergent Mind

Abstract

Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing LLMs. Our research investigates LLMs' decision-making capabilities through the lens of a well-established field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce our framework, GAMA-Bench, including eight classical multi-agent games. We design a scoring scheme to assess a model's performance in these games quantitatively. Through GAMA-Bench, we investigate LLMs' robustness, generalizability, and enhancement strategies. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on GAMA-Bench, achieving a score of 60.5. Moreover, Gemini-1.0-Pro and GPT-3.5 (0613, 1106, 0125) demonstrate similar intelligence on GAMA-Bench. The code and experimental results are made publicly available via https://github.com/CUHK-ARISE/GAMABench.

Architecture of the proposed deep learning model for enhanced image recognition performance.

Overview

  • The paper introduces a framework, $ ext{ extgamma}$-Bench, for evaluating LLMs' decision-making abilities using Game Theory across eight multi-agent games.

  • It finds that models like GPT-4 significantly outperform predecessors in decision-making, though issues with generalizability across games persist.

  • The study highlights a progression in LLM capabilities over time, with newer versions showing improved decision-making and strategic planning.

  • Future research directions include expanding the framework with more complex games and exploring strategies to enhance LLMs' generalizability and robustness.

Evaluating LLMs' Decision-Making in Multi-Agent Environments

Overview

LLMs have demonstrated remarkable capabilities across various tasks. However, evaluating these models' decision-making abilities, especially in complex scenarios involving multiple agents, remains a challenging frontier. This paper introduces a comprehensive framework designed to assess LLMs in the context of Game Theory, named $\gamma$-Bench. It comprises eight classical multi-agent games, categorically grouped to analyze models' performance across different strategic nuances such as cooperation, competition, and mixed motives. The framework not only evaluates the decision-making prowess of LLMs but also provides insights into their robustness, generalizability, and potential improvement strategies. Notably, it reveals that while models like GPT-3.5 exhibit robust decision-making capabilities, their generalizability across different games remains constrained. The study also highlights the apparent enhancement in decision-making abilities of subsequent LLM versions, with GPT-4 outperforming its predecessors.

Methodological Approach

The research meticulously crafts a scoring scheme tailored for quantitatively measuring LLMs' performance in the game-theoretic context. Key elements of the methodology include:

  • Framework Design: $\gamma$-Bench incorporates eight strategically selected games, allowing for a nuanced analysis of LLMs' decision-making in scenarios involving cooperation, competition, and a blend of both.
  • Scoring Scheme: A novel scoring system is proposed to quantitatively assess LLMs' performance, focusing on their strategic soundness and effectiveness in various gaming contexts.
  • Robustness and Generalizability: The framework evaluates models' robustness in game strategy execution and their generalizability across different gaming setups.

Experimental Findings

The paper presents a thorough comparative analysis of several LLMs, including different versions of GPT-3.5 and GPT-4, through $\gamma$-Bench. Some of the pivotal experimental findings are:

  • Performance Rankings: GPT-4 emerges as the top-performing model with a score of 72.5, outshining its predecessors and showcasing notable advancements in LLMs' decision-making abilities.
  • Robustness vs. Generalizability: While models like GPT-3.5 demonstrate substantial robustness in their strategic implementations, they exhibit limited generalizability across diverse game setups.
  • Version-wise Improvement: Sequential versions of GPT-3.5 show a progressive enhancement in intelligence and decision-making capability, illustrating the rapid evolution of LLMs.

Theoretical and Practical Implications

The study makes significant contributions both theoretically and practically. Theoretically, it extends the evaluation of LLMs into the realm of Game Theory, offering a new perspective on assessing artificial intelligence. Practically, the findings shed light on the strengths and limitations of current LLMs in complex decision-making scenarios, indicating areas for further enhancement. Moreover, the improvement strategies identified, such as the Chain-of-Thought prompting, suggest actionable paths to ameliorate LLMs' performance.

Future Directions

Looking ahead, the paper posits several avenues for future research:

  • Expanding the Framework: Incorporating more diverse and complex games could further deepen the understanding of LLMs' decision-making capabilities.
  • Cross-model Evaluations: Comparative studies across a broader range of models could unearth more insights into the generalizable aspects of LLM intelligence.
  • Enhancement Strategies: Exploring additional strategies for improving LLMs' generalizability and robustness in strategic decision-making remains a promising research domain.

In summary, this examination of LLMs' decision-making in multi-agent environments, through the lens of Game Theory, unveils critical insights into the capabilities and limitations of current models. It not only benchmarks their performance but also paves the way for future enhancements, promising a trajectory of rapid advancement in LLM intelligence and its applicability in complex decision-making scenarios.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.