Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is Feedback All You Need? Leveraging Natural Language Feedback in Goal-Conditioned Reinforcement Learning (2312.04736v1)

Published 7 Dec 2023 in cs.CL and cs.AI

Abstract: Despite numerous successes, the field of reinforcement learning (RL) remains far from matching the impressive generalisation power of human behaviour learning. One possible way to help bridge this gap be to provide RL agents with richer, more human-like feedback expressed in natural language. To investigate this idea, we first extend BabyAI to automatically generate language feedback from the environment dynamics and goal condition success. Then, we modify the Decision Transformer architecture to take advantage of this additional signal. We find that training with language feedback either in place of or in addition to the return-to-go or goal descriptions improves agents' generalisation performance, and that agents can benefit from feedback even when this is only available during training, but not at inference.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. J. R. Saffran, R. N. Aslin, and E. L. Newport, “Statistical learning by 8-month-old infants,” Science, vol. 274, no. 5294, pp. 1926–1928, 1996. [Online]. Available: http://www.jstor.org/stable/2891705
  2. J. A. C. Hattie and H. S. Timperley, “The power of feedback,” Review of Educational Research, vol. 77, pp. 112 – 81, 2007.
  3. L. E. Schulz, “The origins of inquiry: inductive inference and exploration in early childhood,” Trends in Cognitive Sciences, vol. 16, pp. 382–389, 2012.
  4. T. Miller, “Explanation in artificial intelligence: Insights from the social sciences,” Artificial intelligence, vol. 267, pp. 1–38, 2019.
  5. T. Lombrozo, “The structure and function of explanations,” Trends in cognitive sciences, vol. 10, no. 10, pp. 464–470, 2006.
  6. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
  7. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” nature, vol. 550, no. 7676, pp. 354–359, 2017.
  8. O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
  9. C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. W. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang, “Dota 2 with large scale deep reinforcement learning,” ArXiv, vol. abs/1912.06680, 2019.
  10. J. M. C. Ocana, R. Capobianco, and D. Nardi, “An overview of environmental features that impact deep reinforcement learning in sparse-reward domains,” Journal of Artificial Intelligence Research, vol. 76, pp. 1181–1218, 2023.
  11. D. Abel, W. Dabney, A. Harutyunyan, M. K. Ho, M. L. Littman, D. Precup, and S. Singh, “On the expressivity of markov reward,” ArXiv, vol. abs/2111.00876, 2021.
  12. L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” in Neural Information Processing Systems, 2021.
  13. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
  14. J. Scheurer, J. A. Campos, T. Korbak, J. S. Chan, A. Chen, K. Cho, and E. Perez, “Training language models with language feedback at scale,” ArXiv, vol. abs/2303.16755, 2023.
  15. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. E. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. J. Lowe, “Training language models to follow instructions with human feedback,” ArXiv, vol. abs/2203.02155, 2022.
  16. N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. J. Lowe, C. Voss, A. Radford, D. Amodei, and P. Christiano, “Learning to summarize from human feedback,” ArXiv, vol. abs/2009.01325, 2020.
  17. D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” ArXiv, vol. abs/1909.08593, 2019.
  18. J. Yang, Y. Dong, S. Liu, B. Li, Z. Wang, C. Jiang, H. Tan, J. Kang, Y. Zhang, K. Zhou, and Z. Liu, “Octopus: Embodied vision-language programmer from environmental feedback,” 2023.
  19. Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan, L. Gui, Y.-X. Wang, Y. Yang, K. Keutzer, and T. Darrell, “Aligning large multimodal models with factually augmented rlhf,” ArXiv, vol. abs/2309.14525, 2023.
  20. S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, T. Wang, S. Marks, C.-R. Ségerie, M. Carroll, A. Peng, P. J. Christoffersen, M. Damani, S. Slocum, U. Anwar, A. Siththaranjan, M. Nadeau, E. J. Michaud, J. Pfau, D. Krasheninnikov, X. Chen, L. L. di Langosco, P. Hase, E. Biyik, A. D. Dragan, D. Krueger, D. Sadigh, and D. Hadfield-Menell, “Open problems and fundamental limitations of reinforcement learning from human feedback,” ArXiv, vol. abs/2307.15217, 2023.
  21. S. An, Z. Ma, Z. Lin, N. Zheng, J.-G. Lou, and W. Chen, “Learning from mistakes makes llm better reasoner,” ArXiv, vol. abs/2310.20689, 2023.
  22. B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Lidén, Z. Yu, W. Chen, and J. Gao, “Check your facts and try again: Improving large language models with external knowledge and automated feedback,” ArXiv, vol. abs/2302.12813, 2023.
  23. T. Wang, P. Yu, X. Tan, S. O’Brien, R. Pasunuru, J. Dwivedi-Yu, O. Y. Golovneva, L. Zettlemoyer, M. Fazel-Zarandi, and A. Celikyilmaz, “Shepherd: A critic for language model generation,” ArXiv, vol. abs/2308.04592, 2023.
  24. A. Madaan, N. Tandon, P. Clark, and Y. Yang, “Memory-assisted prompt editing to improve gpt-3 after deployment,” in Conference on Empirical Methods in Natural Language Processing, 2022.
  25. X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language models to self-debug,” 2023.
  26. A. Ni, S. Iyer, D. R. Radev, V. Stoyanov, W. tau Yih, S. I. Wang, and X. V. Lin, “Lever: Learning to verify language-to-code generation with execution,” ArXiv, vol. abs/2302.08468, 2023.
  27. X. Wang, H. Peng, R. Jabbarvand, and H. Ji, “Leti: Learning to generate from textual interactions,” ArXiv, vol. abs/2305.10314, 2023.
  28. J. Yang, A. Prabhakar, K. Narasimhan, and S. Yao, “Intercode: Standardizing and benchmarking interactive coding with execution feedback,” ArXiv, vol. abs/2306.14898, 2023.
  29. W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. R. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embodied reasoning through planning with language models,” in Conference on Robot Learning, 2022.
  30. K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. D. Reid, and N. Sünderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” ArXiv, vol. abs/2307.06135, 2023.
  31. M. Skreta, N. Yoshikawa, S. Arellano-Rubach, Z. Ji, L. B. Kristensen, K. Darvish, A. Aspuru-Guzik, F. Shkurti, and A. Garg, “Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting,” ArXiv, vol. abs/2303.14100, 2023.
  32. G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. J. Fan, and A. Anandkumar, “Voyager: An open-ended embodied agent with large language models,” ArXiv, vol. abs/2305.16291, 2023.
  33. A. L. Putterman, K. Lu, I. Mordatch, and P. Abbeel, “Pretraining for language conditioned imitation with transformers,” 2022.
  34. A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Neural Information Processing Systems, 2017.
  35. N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Conference on Empirical Methods in Natural Language Processing, 2019.
  36. D. R. Ha and J. Schmidhuber, “World models,” ArXiv, vol. abs/1803.10122, 2018.
  37. D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” ArXiv, vol. abs/1912.01603, 2019.
  38. D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari with discrete world models,” ArXiv, vol. abs/2010.02193, 2020.
  39. D. Hafner, J. Pasukonis, J. Ba, and T. P. Lillicrap, “Mastering diverse domains through world models,” ArXiv, vol. abs/2301.04104, 2023.
  40. C. Chen, Y.-F. Wu, J. Yoon, and S. Ahn, “Transdreamer: Reinforcement learning with transformer world models,” ArXiv, vol. abs/2202.09481, 2022.
  41. J. Lin, Y. Du, O. Watkins, D. Hafner, P. Abbeel, D. Klein, and A. D. Dragan, “Learning to model the world with language,” ArXiv, vol. abs/2308.01399, 2023.
  42. M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio, “Babyai: A platform to study the sample efficiency of grounded language learning,” in International Conference on Learning Representations, 2018.
  43. M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 737–10 746, 2019.
  44. A. R. Akula, S. Gella, A. Padmakumar, M. Namazifar, M. Bansal, J. Thomason, and D. Z. Hakkani-Tür, “Alfred-l: Investigating the role of language for action learning in interactive visual environments,” in Conference on Empirical Methods in Natural Language Processing, 2022.
  45. A. Padmakumar, J. Thomason, A. Shrivastava, P. Lange, A. Narayan-Chen, S. Gella, R. Piramithu, G. Tur, and D. Z. Hakkani-Tür, “Teach: Task-driven embodied agents that chat,” in AAAI Conference on Artificial Intelligence, 2021.
  46. M. Shridhar, X. Yuan, M.-A. Côté, Y. Bisk, A. Trischler, and M. J. Hausknecht, “Alfworld: Aligning text and embodied environments for interactive learning,” ArXiv, vol. abs/2010.03768, 2020.
  47. A. Suglia, I. Konstas, A. Vanzo, E. Bastianelli, D. Elliott, S. Frank, and O. Lemon, “Compguesswhat?!: A multi-task evaluation framework for grounded language learning,” in Annual Meeting of the Association for Computational Linguistics, 2020.
  48. L. Ruis, J. Andreas, M. Baroni, D. Bouchacourt, and B. M. Lake, “A benchmark for systematic generalization in grounded language understanding,” ArXiv, vol. abs/2003.05161, 2020.
  49. T. Cao, J. Wang, Y. Zhang, and S. Manivasagam, “Zero-shot compositional policy learning via language grounding,” 2020.
  50. C. Heinze-Deml and D. Bouchacourt, “Think before you act: A simple baseline for compositional generalization,” ArXiv, vol. abs/2009.13962, 2020.
  51. Z. Wu, E. Kreiss, D. C. Ong, and C. Potts, “Reascan: Compositional reasoning in language grounding,” ArXiv, vol. abs/2109.08994, 2021.
  52. A. Sikarwar, A. Patel, and N. Goyal, “When can transformers ground and compose: Insights from compositional generalization benchmarks,” ArXiv, vol. abs/2210.12786, 2022.
  53. M. Aghzal, E. Plaku, and Z. Yao, “Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning,” ArXiv, vol. abs/2310.03249, 2023.
  54. R. Kirk, A. Zhang, E. Grefenstette, and T. Rocktäschel, “A survey of zero-shot generalisation in deep reinforcement learning,” Journal of Artificial Intelligence Research, vol. 76, pp. 201–264, 2021.
  55. D. Hupkes, V. Dankers, M. Mul, and E. Bruni, “Compositionality decomposed: How do neural networks generalise?” Journal of Artificial Intelligence Research, vol. 67, pp. 757–795, 2020.
  56. H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, C. Bishop, V. Carbune, and A. Rastogi, “Rlaif: Scaling reinforcement learning from human feedback with ai feedback,” ArXiv, vol. abs/2309.00267, 2023.
  57. M. Towers, J. K. Terry, A. Kwiatkowski, J. U. Balis, G. d. Cola, T. Deleu, M. Goulão, A. Kallinteris, A. KG, M. Krimmel, R. Perez-Vicente, A. Pierré, S. Schulhoff, J. J. Tai, A. T. J. Shen, and O. G. Younis, “Gymnasium,” Mar. 2023. [Online]. Available: https://zenodo.org/record/8127025
Citations (1)

Summary

We haven't generated a summary for this paper yet.