Self-Reflection in LLM Agents: Effects on Problem-Solving Performance (2405.06682v3)
Abstract: In this study, we investigated the effects of self-reflection in LLMs on problem-solving performance. We instructed nine popular LLMs to answer a series of multiple-choice questions to provide a performance baseline. For each incorrectly answered question, we instructed eight types of self-reflecting LLM agents to reflect on their mistakes and provide themselves with guidance to improve problem-solving. Then, using this guidance, each self-reflecting agent attempted to re-answer the same questions. Our results indicate that LLM agents are able to significantly improve their problem-solving performance through self-reflection ($p < 0.001$). In addition, we compared the various types of self-reflection to determine their individual contribution to performance. All code and data are available on GitHub at https://github.com/matthewrenze/self-reflection
- T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural Information Processing Systems, vol. 35, 5 2022, pp. 22 199–22 213. [Online]. Available: https://arxiv.org/abs/2205.11916
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” arXiv, 1 2022. [Online]. Available: https://arxiv.org/abs/2201.11903
- Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,” The Eleventh International Conference on Learning Representations, 11 2023. [Online]. Available: https://arxiv.org/abs/2211.01910
- Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, D. Chen, H. S. Chan, W. Dai, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, 2 2022. [Online]. Available: http://dx.doi.org/10.1145/3571730
- M. U. Hadi, qasem al tashi, R. Qureshi, A. Shah, amgad muneer, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjalili, Q. Al-Tashi, and A. Muneer, “A survey on large language models: Applications, challenges, limitations, and practical usage,” Authorea Preprints, 10 2023. [Online]. Available: https://doi.org/10.36227/techrxiv.23589741.v1
- A. Payandeh, D. Pluth, J. Hosier, X. Xiao, and V. K. Gurbani, “How susceptible are llms to logical fallacies?” arXiv, 8 2023. [Online]. Available: https://arxiv.org/abs/2308.09853v1
- J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou, “Large language models cannot self-correct reasoning yet,” arXiv, 10 2023. [Online]. Available: https://arxiv.org/abs/2310.01798
- Z. Ji, T. Yu, Y. Xu, N. Lee, E. Ishii, and P. Fung, “Towards mitigating hallucination in large language models via self-reflection,” arXiv, 10 2023. [Online]. Available: https://arxiv.org/abs/2310.06271
- S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao, “Large language models: A survey,” arXiv, 2 2024. [Online]. Available: https://arxiv.org/abs/2402.06196
- N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” arXiv, 3 2023. [Online]. Available: https://arxiv.org/abs/2303.11366
- L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y. Wang, “Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,” arXiv, 8 2023. [Online]. Available: https://arxiv.org/abs/2308.03188
- A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark, “Self-refine: Iterative refinement with self-feedback,” arXiv, 3 2023. [Online]. Available: https://arxiv.org/abs/2303.17651v2
- J. Toy, J. MacAdam, and P. Tabor, “Metacognition is all you need? using introspection in generative agents to improve goal-directed behavior,” arXiv, 1 2024. [Online]. Available: https://arxiv.org/abs/2401.10910
- Y. Wang and Y. Zhao, “Metacognitive prompting improves understanding in large language models,” arXiv, 8 2023. [Online]. Available: https://arxiv.org/abs/2308.05342
- A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” arXiv, 10 2023. [Online]. Available: https://arxiv.org/abs/2310.11511v1
- G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, A. Anandkumar, U. Austin, and U. Madison, “Voyager: An open-ended embodied agent with large language models,” arXiv, 5 2023. [Online]. Available: https://arxiv.org/abs/2305.16291v2
- Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou, W. Wang, C. Jiang, Y. Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng, Q. Zhang, W. Qin, Y. Zheng, X. Qiu, X. Huang, and T. Gui, “The rise and potential of large language model based agents: A survey,” arXiv, 9 2023. [Online]. Available: https://arxiv.org/abs/2309.07864v3
- S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” arXiv, 10 2022. [Online]. Available: https://arxiv.org/abs/2210.03629
- N. Miao, Y. W. Teh, and T. Rainforth, “Selfcheck: Using llms to zero-shot check their own step-by-step reasoning,” arXiv, 8 2023. [Online]. Available: https://arxiv.org/abs/2308.00436v3
- R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. S. Openai, “Webgpt: Browser-assisted question-answering with human feedback,” arXiv, 12 2021. [Online]. Available: https://arxiv.org/abs/2112.09332v3
- T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” arXiv, 2 2023. [Online]. Available: https://arxiv.org/abs/2302.04761v1
- P. Lewis and et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” arXiv, 5 2020. [Online]. Available: https://arxiv.org/abs/2005.11401
- Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv, 12 2023. [Online]. Available: https://arxiv.org/abs/2312.10997v5
- W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang, “Memorybank: Enhancing large language models with long-term memory,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19 724–19 731, 5 2023. [Online]. Available: https://arxiv.org/abs/2305.10250v3
- Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y. Liang, “Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,” arXiv, 3 2024. [Online]. Available: https://arxiv.org/abs/2403.05313
- G. Tyen, H. Mansoor, V. Cărbune, P. Chen, and T. Mak, “Llms cannot find reasoning errors, but can correct them!” arXiv, 11 2023. [Online]. Available: https://arxiv.org/abs/2311.08516v2
- P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” ArXiv, 3 2018. [Online]. Available: https://arxiv.org/abs/1803.05457
- R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. [Online]. Available: https://arxiv.org/abs/1905.07830
- A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering,” in Proceedings of the Conference on Health, Inference, and Learning. PMLR, 2022, pp. 248–260. [Online]. Available: https://proceedings.mlr.press/v174/pal22a.html
- W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan, “Agieval: A human-centric benchmark for evaluating foundation models,” ArXiv, 4 2023. [Online]. Available: https://arxiv.org/abs/2304.06364
- J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, and Y. Zhang, “Logiqa: A challenge dataset for machine reading comprehension with logical reasoning,” in International Joint Conference on Artificial Intelligence, 2020. [Online]. Available: https://arxiv.org/abs/2007.08124
- S. Wang, Z. Liu, W. Zhong, M. Zhou, Z. Wei, Z. Chen, and N. Duan, “From lsat: The progress and challenges of complex reasoning,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 30, pp. 2201–2216, 8 2021. [Online]. Available: https://doi.org/10.1109/TASLP.2022.3164218
- Anthropic, “Introducing the next generation of claude anthropic,” 2024. [Online]. Available: https://www.anthropic.com/news/claude-3-family
- ——, “The claude 3 model family: Opus, sonnet, haiku,” 2024. [Online]. Available: https://www.anthropic.com/claude-3-model-card
- Cohere, “Command r+,” 2024. [Online]. Available: https://docs.cohere.com/docs/command-r-plus
- ——, “Model card for c4ai command r+,” 2024. [Online]. Available: https://huggingface.co/CohereForAI/c4ai-command-r-plus
- S. Pichai and D. Hassabis, “Introducing gemini: Google’s most capable ai model yet,” 2023. [Online]. Available: https://blog.google/technology/ai/google-gemini-ai/
- Gemini-Team, “Gemini: A family of highly capable multimodal models,” arXiv, 12 2023.
- S. Pichai and D. Hassabis, “Introducing gemini 1.5, google’s next-generation ai model,” 2024. [Online]. Available: https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/
- Gemini-Team, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” 2024. [Online]. Available: https://arxiv.org/abs/2403.05530
- OpenAI, “Introducing chatgpt,” 11 2022. [Online]. Available: https://openai.com/blog/chatgpt
- ——, “Models - openai api.” [Online]. Available: https://platform.openai.com/docs/models/gpt-3-5-turbo
- ——, “Gpt-4,” 3 2023. [Online]. Available: https://openai.com/research/gpt-4
- ——, “Gpt-4 technical report,” arXiv, 3 2023. [Online]. Available: https://arxiv.org/abs/2303.08774
- Meta, “Meta and microsoft introduce the next generation of llama | meta,” 2023. [Online]. Available: https://about.meta.com/news/2023/07/llama-2/
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,” arXiv, 7 2023. [Online]. Available: https://arxiv.org/abs/2307.09288
- Mistral-AI-Team, “Au large | mistral ai | frontier ai in your hands,” 2024. [Online]. Available: https://mistral.ai/news/mistral-large/
- S. M. Bsharat, A. Myrzakhan, and Z. Shen, “Principled instructions are all you need for questioning llama-1/2, gpt-3.5/4,” arXiv, 12 2023. [Online]. Available: https://arxiv.org/abs/2312.16171
- M. Renze and E. Guven, “The benefits of a concise chain of thought on problem-solving in large language models,” arXiv, 1 2024. [Online]. Available: https://arxiv.org/abs/2401.05618v1
- ——, “The effect of sampling temperature on problem solving in large language models,” arXiv, 2 2024. [Online]. Available: https://arxiv.org/abs/2402.05201v1
- Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,” Psychometrika, vol. 12, pp. 153–157, 6 1947. [Online]. Available: https://doi.org/10.1007/BF02295996
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.