Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents (2407.01887v4)
Abstract: In-Context Reinforcement Learning (ICRL) is a frontier paradigm to solve Reinforcement Learning (RL) problems in the foundation model era. While ICRL capabilities have been demonstrated in transformers through task-specific training, the potential of LLMs out-of-the-box remains largely unexplored. This paper investigates whether LLMs can generalize cross-domain to perform ICRL under the problem of Dueling Bandits (DB), a stateless preference-based RL setting. We find that the top-performing LLMs exhibit a notable zero-shot capacity for relative decision-making, which translates to low short-term weak regret across all DB environment instances by quickly including the best arm in duels. However, an optimality gap still exists between LLMs and classic DB algorithms in terms of strong regret. LLMs struggle to converge and consistently exploit even when explicitly prompted to do so, and are sensitive to prompt variations. To bridge this gap, we propose an agentic flow framework: LLM with Enhanced Algorithmic Dueling (LEAD), which integrates off-the-shelf DB algorithm support with LLM agents through fine-grained adaptive interplay. We show that LEAD has theoretical guarantees inherited from classic DB algorithms on both weak and strong regret. We validate its efficacy and robustness even with noisy and adversarial prompts. The design of such an agentic framework sheds light on how to enhance the trustworthiness of general-purpose LLMs generalized to in-context decision-making tasks.
- Do llm agents have regret? a case study in online learning and games. arXiv preprint arXiv:2403.16843, 2024.
- Can large language models explore in-context? arXiv preprint arXiv:2403.15371, 2024.
- The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538--1556, 2012.
- Relative confidence sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 73--82, 2014.
- Bandit algorithms. Cambridge University Press, 2020.
- Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201--1208, 2009.
- Multi-dueling bandits with dependent arms. arXiv preprint arXiv:1705.00253, 2017.
- Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856--864. PMLR, 2014.
- Versatile dueling bandits: Best-of-both-world analyses for online learning from preferences. In ICML 2022-39th International Conference on Machine Learning, pages 1--25, 2022.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008--3021, 2020.
- Is rlhf more difficult than standard rl? a theoretical perspective. Advances in Neural Information Processing Systems, 36, 2024.
- Clinical decision support for bipolar depression using large language models. Neuropsychopharmacology, pages 1--5, 2024.
- Legal syllogism prompting: Teaching large language models for legal judgment prediction. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, pages 417--421, 2023.
- The wall street neophyte: A zero-shot analysis of chatgpt over multimodal stock movement prediction challenges. arXiv preprint arXiv:2304.05351, 2023.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199--22213, 2022.
- Double thompson sampling for dueling bandits. Advances in neural information processing systems, 29, 2016.
- Beat the mean bandit. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 241--248. Citeseer, 2011.
- Relative upper confidence bound for the k-armed dueling bandit problem. In International conference on machine learning, pages 10--18. PMLR, 2014.
- Preference-based learning for exoskeleton gait optimization. In 2020 IEEE international conference on robotics and automation (ICRA), pages 2351--2357. IEEE, 2020.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744, 2022.
- Llms-augmented contextual bandit. arXiv preprint arXiv:2311.02268, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837, 2022.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
- Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
- Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
- Do as i can, not as i say: Grounding language in robotic affordances. In Conference on robot learning, pages 287--318. PMLR, 2023.
- Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
- Large language models to enhance bayesian optimization. arXiv preprint arXiv:2402.03921, 2024.
- Generic exploration and k-armed voting bandits. In International conference on machine learning, pages 91--99. PMLR, 2013.
- Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on learning theory, pages 1141--1154. PMLR, 2015.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324--345, 1952.
- Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193--202, 1975.
- Adversarial bandits with corruptions: Regret lower bound and no-regret algorithm. Advances in Neural Information Processing Systems, 33:19943--19952, 2020.
- Exploring the sensitivity of llms’ decision-making capabilities: Insights from prompt variation and hyperparameters. arXiv preprint arXiv:2312.17476, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.