Emergent Mind

Reflexion: Language Agents with Verbal Reinforcement Learning

(2303.11366)
Published Mar 20, 2023 in cs.AI , cs.CL , and cs.LG

Abstract

LLMs have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.

Reflexion and ReAct enhance search and reasoning on HotPotQA questions compared to CoT methods.

Overview

  • Reflexion introduces a novel paradigm in LLMs, focusing on verbal reinforcement to improve language agents.

  • This method enables agents to generate reflective feedback and use it for future decision-making, mirroring human learning patterns.

  • It surpasses traditional reinforcement learning (RL) by avoiding extensive data needs and fine-tuning, and it's computationally efficient.

  • Experimental evidence shows Reflexion outperforming state-of-the-art models like GPT-4 in tasks such as coding, decision-making, and reasoning.

Overview

The realm of LLMs has witnessed a significant transformation with the introduction of Reflexion, a novel paradigm that shifts the focus towards enhancing language agents through verbal reinforcement. This method diverges from traditional reinforcement learning techniques, which predominantly rely on extensive training data and model fine-tuning, by employing linguistic feedback for agent improvement.

The Essence of Reflexion

Reflexion stands out by allowing agents to internally generate reflective textual feedback based on their performance in various tasks. This reflective feedback is then stored in an episodic memory, enabling the agent to make more informed decisions in future attempts. This process mirrors human learning patterns where reflection on past experiences leads to improved future actions. Remarkably, Reflexion is versatile, accommodating different types of feedback signals and sources, whether they are external or internally simulated.

Comparative Advantages

Traditional reinforcement learning (RL) methods, though effective, come with their set of challenges, including the need for substantial computational power and intricacies in performing accurate credit assignments with scalar or vector rewards. Reflexion addresses these challenges by:

  • Being computationally efficient as it doesn't necessitate fine-tuning of the LLM.
  • Offering a nuanced feedback system that transcends basic scalar or vector rewards, thus providing more targeted action adjustments.
  • Enabling a more explicit and interpretable episodic memory of prior experiences.
  • Furnishing more explicit action hints for future episodes.

Empirical Evidences

The effectiveness of Reflexion is underscored by its impressive performance across a spectrum of tasks including sequential decision-making, reasoning, and programming. Notably, it achieved a 91% pass@1 accuracy on the HumanEval coding benchmark, outperforming the previous state-of-the-art GPT-4, which secured an 80% accuracy. This stark improvement highlights Reflexion's potential to redefine benchmarks in generative AI tasks.

Experimental Insights

Reflexion's integration into tasks like the AlfWorld suite and HotPotQA showcased its ability to substantially boost agent performance by up to 22% and 20% respectively over traditional approaches. These experiments underline Reflexion’s proficiency in not only interpreting the task at hand but also in leveraging past experiences to enhance future task execution. In programming tasks, Reflexion not only set new benchmarks in code generation accuracy but also demonstrated its language-agnostic capability, offering promising implications for a wide range of programming languages.

Limitations and Future Directions

While Reflexion introduces a groundbreaking approach to enabling agents to learn from linguistic feedback, it's essential to acknowledge its limitations. The simplification of retaining episodic memory to a fixed size may not always encapsulate the depth of experiences needed for complex decision-making. Future work could explore the expansion of memory mechanisms and delve into more sophisticated models that encompass a broader spectrum of learning strategies, mirroring human cognitive processes more closely.

Conclusion

Reflexion represents a significant leap forward in the development of intelligent language agents, offering a novel and effective approach to learning through verbal reinforcement. By enabling agents to self-reflect and learn from their experiences, Reflexion poses to significantly advance the capabilities of generative AI, pushing the boundaries of what's possible in autonomous decision-making and reasoning tasks.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
  2. Program Synthesis with Large Language Models
  3. Large Language Models can Implement Policy Iteration
  4. Multipl-e: A scalable and extensible approach to benchmarking neural code generation
  5. CodeT: Code Generation with Generated Tests
  6. Evaluating Large Language Models Trained on Code
  7. Teaching Large Language Models to Self-Debug
  8. Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pages 41–75. Springer.
  9. Goodman, N. (2023). Meta-prompt: A simple self-improving language agent. noahgoodman.substack.com.
  10. Language Models can Solve Computer Tasks
  11. A large-scale longitudinal study of flaky tests. Proc. ACM Program. Lang., 4(OOPSLA).
  12. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328.
  13. StarCoder: may the source be with you!
  14. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  15. Self-Refine: Iterative Refinement with Self-Feedback
  16. DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents
  17. WebGPT: Browser-assisted question-answering with human feedback
  18. OpenAI (2023). Gpt-4 technical report. ArXiv.
  19. Generative Agents: Interactive Simulacra of Human Behavior
  20. REFINER: Reasoning Feedback on Intermediate Representations
  21. Automatic Prompt Optimization with "Gradient Descent" and Beam Search
  22. Toolformer: Language Models Can Teach Themselves to Use Tools
  23. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
  24. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR).
  25. Reinforcement Learning: An Introduction. The MIT Press, second edition.
  26. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  27. Self-Evaluation Guided Beam Search for Reasoning
  28. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  29. Webshop: Towards scalable real-world web interaction with grounded language agents. In ArXiv.
  30. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR).
  31. Answering Questions by Meta-Reasoning over Multiple Chains of Thought

Show All 31