Emergent Mind

Abstract

LLMs have demonstrated the potential to mimic human social intelligence. However, most studies focus on simplistic and static self-report or performance-based tests, which limits the depth and validity of the analysis. In this paper, we developed a novel framework, InterIntent, to assess LLMs' social intelligence by mapping their ability to understand and manage intentions in a game setting. We focus on four dimensions of social intelligence: situational awareness, self-regulation, self-awareness, and theory of mind. Each dimension is linked to a specific game task: intention selection, intention following, intention summarization, and intention guessing. Our findings indicate that while LLMs exhibit high proficiency in selecting intentions, achieving an accuracy of 88\%, their ability to infer the intentions of others is significantly weaker, trailing human performance by 20\%. Additionally, game performance correlates with intention understanding, highlighting the importance of the four components towards success in this game. These findings underline the crucial role of intention understanding in evaluating LLMs' social intelligence and highlight the potential of using social deduction games as a complex testbed to enhance LLM evaluation. InterIntent contributes a structured approach to bridging the evaluation gap in social intelligence within multiplayer games.

Four dimensions for assessing social intelligence in Avalon via dynamic gaming contexts.

Overview

  • The paper by Liu et al. introduces a novel framework to evaluate the social intelligence of LLMs using the multiplayer game Avalon, focusing on dynamic, interactive assessments rather than traditional, static methods.

  • The framework evaluates LLMs' capabilities across four components of social intelligence: situational awareness (intention selection), self-regulation (intention following), self-awareness (intention summarization), and theory of mind (intention guessing).

  • Results indicate that GPT-3.5 and GPT-4 perform well in recognizing and selecting intentions but struggle to predict others' intentions, highlighting the gap between current LLMs' social intelligence and human performance.

Investigating Social Intelligence of LLMs: An Intention Understanding Approach

The paper, "InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context" by Liu et al., addresses the social intelligence of LLMs by presenting a novel evaluation framework within the context of a multiplayer game called Avalon. This research breaks new ground by shifting the focus of LLM social intelligence assessment from traditional, static methods to a dynamic, interactive game context.

Abstract

The authors introduce a new framework designed to assess LLMs' social intelligence by evaluating their capability to understand and manage intentions in a gaming environment. Specifically, they concentrate on four key components of social intelligence: situational awareness, self-regulation, self-awareness, and theory of mind (ToM). Within Avalon, these components are mapped to intention selection, intention following, intention summarization, and intention guessing, respectively. The findings show that LLMs, especially GPT-3.5 and GPT-4, perform well in recognizing and selecting intentions but fall short in accurately predicting others' intentions compared to human performance.

Introduction

The introduction establishes the problem of evaluating LLMs' social intelligence, highlighting that existing methods are predominantly simplistic and static, focusing on performance-based tests that lack depth and validity. The authors propose using social deduction games to create a more complex and dynamic evaluation framework. Avalon, a well-established social deduction game that requires players to engage in strategic conversations and intentions, serves as the test bed for the proposed evaluations.

Methodology

The methodology section details the framework and processes used to implement the evaluation. The authors present a structured, intention-guided game playing mechanism within Avalon, designed to dynamically generate contexts for assessing the LLMs' social intelligence.

Intention Selection

Situational awareness is measured by evaluating the LLMs’ ability to select intentions that are contextually appropriate based on game interactions. This involves assessing the reasonableness of selected intentions against the game's facts, role profiles, and other intentions.

Intention Following

Self-regulation is evaluated by examining how well the LLMs adhere to their selected intentions. This involves both thinking (planning) and speaking (implementing) phases, where the model's responses are graded on a Likert scale from 1 to 5, measuring their adherence to intentions.

Intention Summarization

Self-awareness is assessed by requiring the LLMs to summarize their intentions based on their internal thought processes and speeches. This evaluates the model's capability to introspect and articulate their own intentions accurately.

Intention Guessing

ToM is evaluated by having the LLMs guess others' intentions based on their speeches alone. This task challenges the models to understand and predict the mental states and intentions of other players, reflecting a more rigorous and realistic test of social intelligence.

Results and Discussion

The results demonstrate that GPT-3.5 and GPT-4 can effectively understand and select appropriate intentions, achieving around 88% accuracy. However, their performance in intention guessing remains significantly behind human levels, with GPT-4 performing better than GPT-3.5 but still exhibiting notable deficiencies in ToM capabilities.

Intention Selection and Following

LLMs show good situational comprehension, though translating these intentions into coherent and contextually appropriate speech remains challenging. The correlation between intention understanding and game performance showcases that better selection and adherence to intentions can positively impact game outcomes, particularly for loyal players.

Intention Summarization and Guessing

The study finds that while self-awareness in LLMs can reach human levels, as evidenced by intention summarization tasks, their ToM abilities, especially under complex and dynamic conditions, are significantly weaker. This gap underscores the challenge in developing LLMs that can accurately interpret and predict human-like mental states and intentions.

Implications and Future Directions

The paper's implications are twofold: practical and theoretical. Practically, the proposed framework offers a robust method for evaluating and enhancing LLMs' social intelligence in interactive settings. Theoretically, it advances the understanding of LLMs' capabilities and limitations in simulating human social behaviors.

Looking forward, further research could explore additional dimensions of social intelligence, such as self-correction and creativity, to create a more comprehensive evaluation framework. Addressing the high resource demands associated with these models, future work might also focus on developing efficient methods for simpler tasks to be managed by smaller models while maintaining rigor in complex evaluations.

Conclusion

Liu et al.'s framework represents a significant step towards more nuanced and dynamic evaluations of LLMs' social intelligence. The findings elucidate the strengths and weaknesses of current models and lay the groundwork for future advancements in creating LLMs that more closely mirror human social capabilities.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.