Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

110 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

800 1

Can large language models explore in-context? (2403.15371v3)

Published 22 Mar 2024 in cs.LG, cs.AI, and cs.CL

Abstract: We investigate the extent to which contemporary LLMs can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt. We experiment with GPT-3.5, GPT-4, and Llama2, using a variety of prompt designs, and find that the models do not robustly engage in exploration without substantial interventions: i) Across all of our experiments, only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history, presented as sufficient statistics; ii) All other configurations did not result in robust exploratory behavior, including those with chain-of-thought reasoning but unsummarized history. Although these findings can be interpreted positively, they suggest that external summarization -- which may not be possible in more complex settings -- is important for obtaining desirable behavior from LLM agents. We conclude that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be required to empower LLM-based decision making agents in complex settings.

References (74)

Authors (5)

Akshay Krishnamurthy (92 papers)
Keegan Harris (17 papers)
Dylan J. Foster (66 papers)
Cyril Zhang (34 papers)
Aleksandrs Slivkins (67 papers)

Citations (14)

View on Semantic Scholar

Summary

The paper demonstrates that only GPT-4, when equipped with chain-of-thought prompts and summarized interaction history, successfully navigated the exploration-exploitation trade-off.
The paper investigates LLM behaviors in multi-armed bandit environments by varying prompt designs and temperature settings to assess intrinsic exploration capabilities.
The paper suggests that improved prompt engineering and potential algorithmic fine-tuning are crucial for enhancing LLMs' autonomous decision-making in complex scenarios.

Exploring the Limits of Exploration: How LLMs Fare in Multi-Armed Bandit Environments

Introduction

The capacity for exploration underpins effective decision-making in complex environments. This paper scrutinizes the inherent abilities of contemporary LLMs to engage in exploration, crucial for reinforcement learning (RL) and sequential decision making. By deploying LLMs as agents within multi-armed bandit (MAB) settings—without any training adjustments—the investigation uniquely positions LLMs in scenarios that demand exploration for successful navigation and learning.

Experimental Design

Given the emerging relevance of in-context learning, this paper introduces a systematic examination of LLMs' exploration capabilities via simple yet foundational RL problems: multi-armed bandits. This choice is motivated by the simpleness and analytical tractability of MAB problems, which isolate the exploration-exploitation dilemma fundamental to decision making.

The research employs three LLMs: Gpt-3.5, GPT-4, and Llama2, leveraging various prompt designs to enact the MAB scenario and gather responses. These models are exposed to a set of specifically designed prompts that detail the bandit environment and query for next actions, giving rise to different experimental configurations based on the prompt nuances.

The exploration behaviors of these LLMs are probed across multiple settings:

Environment Complexity: Easy and hard instances of MAB are chosen based on the number of arms and reward distribution complexities.
Temperature Settings: Zero and one temperature settings in LLM prompts aim to distinguish between intrinsic exploration and externally injected randomness.
Prompt Variations: Ranging from basic to advanced prompts, this paper encompasses different scenarios, framings, summarization levels, and prompting for chain-of-thought reasoning.

Results and Findings

Across numerous experimental runs, a single configurational success emerged: GPT-4, complemented by specific prompt attributes that suggested exploration, summarized interaction history, and enforced chain-of-thought reasoning. This configuration exhibited robust exploratory behavior, effectively identifying and exploiting the most rewarding actions in the stipulated bandit environment.

Contrastingly, the majority of configurations demonstrated significant exploration deficiencies, manifesting either through an undue focus on exploiting immediate rewards (akin to a greedy strategy) or through an almost uniform, undiscriminating choice distribution across all actions, indicative of a failure to learn from past interactions.

Specifically, configurations not employing summarized interaction histories or lacking in prompt attributes that explicitly incited exploration were prone to these failures. Interestingly, the exploration success with GPT-4 also highlighted the nuanced but critical role of prompt design in eliciting more sophisticated behaviors from LLMs.

Implications and Future Directions

This investigation underlines the necessity of non-trivial prompt engineering or potential algorithmic interventions to unlock and elevate the decision-making capacities of LLMs in settings that demand robust exploration strategies. Thefindings prompt several lines of inquiry and development:

Further Prompt Exploration: Expanding the diversity and depth of prompts may uncover more nuanced aspects of LLM capabilities.
Algorithmic Interventions: Fine-tuning or custom training paradigms might be essential for cultivating sophisticated exploration behaviors in more complex RL environments.
Methodological Advances: Developing methodologies for cost-effective, large-scale evaluations of LLM behaviors in decision-making contexts is paramount.

Conclusion

While a singular configuration demonstrated the potential for LLMs to engage in strategic exploration within a controlled environment, the overarching evidence points to a generalized struggle among LLMs to autonomously navigate the exploration-exploitation trade-off without explicit guidance. This paper, while focusing on the elemental RL challenge of multi-armed bandits, lays foundational insights for the development of LLMs as more adept decision-making agents intackling broader and more complex decision-making tasks.

PDF Markdown

Tweets

https://twitter.com/emollick/status/1772087196813574618

https://twitter.com/arankomatsuzaki/status/1772075066143990247

https://twitter.com/_akhaliq/status/1772074742880534933

https://twitter.com/IntuitMachine/status/1773119751599935735

https://twitter.com/fly51fly/status/1772171681127399919

https://twitter.com/TheTuringPost/status/1775659579403362691

YouTube

Show All Videos