People cannot distinguish GPT-4 from a human in a Turing test (2405.08007v1)

Published 9 May 2024 in cs.HC and cs.AI

Abstract: We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence.

Citations (17)

View on Semantic Scholar

Summary

The paper demonstrates that GPT-4 achieves a 54% pass rate in a Turing test, showcasing its ability to mimic human conversational nuances.
It employs a controlled experiment comparing GPT-4, GPT-3.5, and ELIZA using a 5-minute conversation framework to assess human indistinguishability.
The study highlights that linguistic style and socio-emotional cues are key factors in distinguishing human interactions from AI responses.

Evaluating GPT-4 in a Modern Turing Test

Background and Motivation

Artificial Intelligence has come a long way since Alan Turing's early paper on machine intelligence in 1950. One of the most compelling questions in AI is whether machines can convincingly mimic human behavior. This paper puts three systems—ELIZA, GPT-3.5, and GPT-4—to the test in a classic Turing test setting. Participants engaged in a 5-minute text conversation with either a human or an AI, aiming to discern the nature of their interlocutor.

Experimental Setup

AI Models Tested

The paper evaluated three AI systems:

GPT-4: A LLM known for its sophisticated text generation capabilities.
GPT-3.5: Another version of OpenAI's LLM, less advanced than GPT-4.
ELIZA: A much older, rule-based system known for mimicking a Rogerian psychotherapist.

GPT-4 and GPT-3.5 were loaded with a custom prompt to mimic a young person using slang and making occasional spelling errors. ELIZA served as a baseline to measure whether people can recognize simple chatbots.

The Turing Test Setup

Participants were randomly assigned roles as interrogators or witnesses. Human witnesses had to convince interrogators of their humanity, while AI witnesses attempted to pass as human. Conversations took place in a messaging interface, and interrogators reached a verdict after 5 minutes. The setup aimed to answer whether people can reliably distinguish AI from humans and figure out the factors influencing their decisions.

Key Results

How Did GPT-4 Perform?

Pass Rates:

GPT-4 was judged human 54% of the time.
GPT-3.5 had a pass rate of 50%.
ELIZA was identified as human only 22% of the time.
Human participants were correctly identified 67% of the time.

Statistical analysis confirmed that GPT-4's pass rate wasn't significantly different from random guessing (50%), implying it convincingly passed the test.

Human Identification:

Surprisingly, human participants' pass rate was not 100%. They were only correctly identified 67% of the time, which indicates that people's confidence in identifying humans isn't absolute, especially with advanced AI systems in play.

What Convinces Humans?

Strategies and Reasons

Participants used several strategies to identify their conversational partners:

Small Talk (36%): Asking personal questions or daily activities.
Social & Emotional Queries (25%): Questions on opinions, humor, experiences.
Knowledge & Reasoning (13%): Questions on math, logic, general knowledge.

Effective strategies for correct identification included:

Asking about human experiences (75% accuracy).
Engaging in logic and math discussions (78% accuracy).
Accusations of being an AI (80% accuracy).

When giving their reasons, interrogators cited:

Linguistic Style (43%): Spelling, grammar, tone.
Socio-emotional Factors (24%): Humor, personality.
Knowledge & Reasoning (10%): Adequate or inadequate knowledge.

Implications of the Research

Practical Concerns

The fact that GPT-4 could pass as human 54% of the time has practical implications:

Economic Roles: AI could take over jobs requiring text-based human interaction.
Deception and Trust: Misleading the public or being mistaken for a human could have social ramifications, affecting trust.

Theoretical Insights

The paper sheds light on what the Turing test measures today. It appears that socio-emotional intelligence and conversational style play a more significant role than sheer factual knowledge and logic. This might mean that the human "touch" in conversations—humor, personality, empathy—are harder to replicate convincingly with AI.

Future Directions

Future work could:

Explore Training: Investigate if educating people on AI characteristics helps in better identification.
Enhance Strategies: Develop and test new questioning techniques to reliably distinguish AI from humans.
Longitudinal Studies: As AI systems evolve, studying how human ability to recognize AI changes over time will be crucial.

Conclusion

This paper provides robust evidence that GPT-4 can pass a Turing test under controlled conditions, marking a significant moment in AI-human interaction research. While promising, the results also underscore the need for ongoing scrutiny and new strategies to manage the social and ethical dimensions of increasingly human-like AI.