Reflexion: Language Agents with Verbal Reinforcement Learning (2303.11366v4)
Abstract: LLMs have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- In-context policy iteration. arXiv preprint arXiv:2210.03821.
- Multipl-e: A scalable and extensible approach to benchmarking neural code generation.
- Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
- Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pages 41–75. Springer.
- Goodman, N. (2023). Meta-prompt: A simple self-improving language agent. noahgoodman.substack.com.
- Language models can solve computer tasks. arXiv preprint arXiv:2303.17491.
- A large-scale longitudinal study of flaky tests. Proc. ACM Program. Lang., 4(OOPSLA).
- Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- Dera: Enhancing large language model completions with dialog-enabled resolving agents. arXiv preprint arXiv:2303.17071.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- OpenAI (2023). Gpt-4 technical report. ArXiv.
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.
- Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904.
- Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
- ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR).
- Reinforcement Learning: An Introduction. The MIT Press, second edition.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Decomposition enhances reasoning via self-evaluation guided decoding. arXiv preprint arXiv:2305.00633.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Webshop: Towards scalable real-world web interaction with grounded language agents. In ArXiv.
- ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR).
- Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007.