2000 character limit reached
A Survey on Transformers in Reinforcement Learning (2301.03044v3)
Published 8 Jan 2023 in cs.LG and cs.AI
Abstract: Transformer has been considered the dominating neural architecture in NLP and CV, mostly under supervised settings. Recently, a similar surge of using Transformers has appeared in the domain of reinforcement learning (RL), but it is faced with unique design choices and challenges brought by the nature of RL. However, the evolution of Transformers in RL has not yet been well unraveled. In this paper, we seek to systematically review motivations and progress on using Transformers in RL, provide a taxonomy on existing works, discuss each sub-field, and summarize future prospects.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
- What matters for on-policy deep actor-critic methods? a large-scale study. In International conference on learning representations, 2020.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Emergent tool use from multi-agent autocurricula. In International Conference on Learning Representations, 2019.
- Video pretraining (VPT): Learning to act by watching unlabeled online videos. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
- Coberl: Contrastive bert for reinforcement learning. In International Conference on Learning Representations, 2021.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Meta-reinforcement learning via language instructions. arXiv preprint arXiv:2209.04924, 2022.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Transfer learning with causal counterfactual reasoning in decision transformers. arXiv preprint arXiv:2110.14355, 2021.
- When does return-conditioned supervised learning work for offline reinforcement learning? arXiv preprint arXiv:2206.01079, 2022.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Unimask: Unified inference in sequential decision problems. arXiv preprint arXiv:2211.10869, 2022.
- Transdreamer: Reinforcement learning with transformer world models. arXiv preprint arXiv:2202.09481, 2022.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
- Transformers for one-shot visual imitation. In Conference on Robot Learning, pp. 2071–2084. PMLR, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. IEEE, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692, 2023.
- Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751, 2021.
- Implementation matters in deep rl: A case study on ppo and trpo. In International conference on learning representations, 2019.
- Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pp. 1407–1416. PMLR, 2018.
- Deep transformer q-networks for partially observable reinforcement learning. arXiv preprint arXiv:2206.01078, 2022.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019.
- Generalized decision transformer for offline hindsight information matching. arXiv preprint arXiv:2111.10364, 2021.
- Learning to reach goals via iterated supervised learning. arXiv preprint arXiv:1912.06088, 2019.
- Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning, pp. 175–187. PMLR, 2023.
- Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
- Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020.
- Mastering atari with discrete world models. In International Conference on Learning Representations, 2021.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Stabilizing deep q-learning with convnets and vision transformers under data augmentation. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
- Deep recurrent q-learning for partially observable mdps. In 2015 aaai fall symposium series, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Decision transformer under random frame dropping. arXiv preprint arXiv:2303.03391, 2023.
- Updet: Universal multi-agent rl via policy decoupling with transformers. In International Conference on Learning Representations, 2020.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022a.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
- Going beyond linear transformers with recurrent fast weight programmers. Advances in Neural Information Processing Systems, 34:7703–7717, 2021.
- Gpt-critic: Offline reinforcement learning for end-to-end task-oriented dialogue systems. In International Conference on Learning Representations, 2022.
- When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems, 32, 2019.
- Reinforcement learning as one big sequence modeling problem. In ICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021.
- Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
- Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022.
- Improving sample efficiency of value based models using attention and vision transformers. arXiv preprint arXiv:2202.00710, 2022.
- Think before you act: Decision transformers with internal working memory. arXiv preprint arXiv:2305.16338, 2023.
- Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
- Contrastive decision transformers. In 6th Annual Conference on Robot Learning, 2022.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- My body is a cage: the role of morphology in graph-based incompatible control. arXiv preprint arXiv:2010.01856, 2020.
- In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv:2210.14215, 2022.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Multi-game decision transformers. In Advances in Neural Information Processing Systems, 2022.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
- Switch trajectory transformer with distributional value approximation for multi-task reinforcement learning. arXiv preprint arXiv:2203.07413, 2022.
- Distributional reward decomposition for reinforcement learning. Advances in neural information processing systems, 32, 2019.
- Goal-conditioned reinforcement learning: Problems and solutions. arXiv preprint arXiv:2201.08299, 2022.
- Working memory graphs. In International conference on machine learning, pp. 6404–6414. PMLR, 2020.
- Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 1, 2021.
- Transformer in transformer as backbone for deep reinforcement learning. arXiv preprint arXiv:2212.14538, 2022.
- Luckeciano C Melo. Transformers are meta-reinforcement learners. In International Conference on Machine Learning, pp. 15340–15359. PMLR, 2022.
- Offline pre-trained multi-agent decision transformer: One big sequence model conquers all starcraftii tasks. arXiv preprint arXiv:2112.02845, 2021.
- Transformers are sample efficient world models. arXiv preprint arXiv:2209.00588, 2022.
- A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Can increasing input dimensionality improve deep reinforcement learning? In International Conference on Machine Learning, pp. 7424–7433. PMLR, 2020.
- Training larger networks for deep reinforcement learning. arXiv preprint arXiv:2102.07920, 2021.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Vector quantized models for planning. In International Conference on Machine Learning, pp. 8302–8313. PMLR, 2021.
- Efficient transformers in reinforcement learning using actor-learner distillation. arXiv preprint arXiv:2104.01655, 2021.
- Stabilizing transformers for reinforcement learning. In International conference on machine learning, pp. 7487–7498. PMLR, 2020.
- You can’t count on luck: Why decision transformers fail in stochastic environments. arXiv preprint arXiv:2205.15967, 2022.
- Planning with large language models via corrective re-prompting. arXiv preprint arXiv:2211.09935, 2022.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Can wikipedia help offline reinforcement learning? arXiv preprint arXiv:2201.12122, 2022.
- Transformer-based world models are happy with 100k interactions. arXiv preprint arXiv:2303.07109, 2023.
- Universal value function approximators. In International conference on machine learning, pp. 1312–1320. PMLR, 2015.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- A generalist dynamics model for control. arXiv preprint arXiv:2305.10912, 2023.
- Masked world models for visual control. In Conference on Robot Learning, pp. 1332–1344. PMLR, 2022a.
- Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning, pp. 19561–19579. PMLR, 2022b.
- Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. arXiv preprint arXiv:2206.11251, 2022.
- Starformer: Transformer with state-action-reward representations for visual reinforcement learning. In European Conference on Computer Vision, pp. 462–479. Springer, 2022.
- How crucial is transformer in decision transformer? arXiv preprint arXiv:2211.14655, 2022.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- D2rl: Deep dense architectures in reinforcement learning. arXiv preprint arXiv:2010.09163, 2020.
- Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871, 2022a.
- Context-aware language modeling for goal-oriented dialogue systems. arXiv preprint arXiv:2204.10198, 2022b.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Plate: Visually-grounded planning with transformers in procedural tasks. IEEE Robotics and Automation Letters, 7(2):4924–4930, 2022.
- Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017.
- Shiro Takagi. On the effect of pre-training for transformer in different modality on offline reinforcement learning. arXiv preprint arXiv:2211.09817, 2022.
- The sensory neuron as a transformer: Permutation-invariant neural networks for reinforcement learning. Advances in Neural Information Processing Systems, 34:22574–22587, 2021.
- Evaluating vision transformer methods for deep reinforcement learning from pixels. arXiv preprint arXiv:2204.04905, 2022.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
- Human-timescale adaptation in an open-ended task space. arXiv preprint arXiv:2301.07608, 2023.
- Creating multimodal interactive agents with imitation and self-supervised learning. arXiv preprint arXiv:2112.03763, 2021.
- Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Multi-environment pretraining enables transfer to action limited datasets. arXiv preprint arXiv:2211.13337, 2022.
- Chai: A chatbot ai for task-oriented dialogue with offline reinforcement learning. arXiv preprint arXiv:2204.08426, 2022.
- Addressing optimism bias in sequence modeling for reinforcement learning. In International Conference on Machine Learning, pp. 22270–22283. PMLR, 2022.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- Bootstrapped transformer for offline reinforcement learning. arXiv preprint arXiv:2206.08569, 2022.
- Multi-agent multi-game entity transformer. 2023b.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023c.
- Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995–2003. PMLR, 2016.
- Multi-agent reinforcement learning is a sequence modeling problem. arXiv preprint arXiv:2205.14953, 2022.
- Spring: Gpt-4 out-performs rl algorithms by studying papers and reasoning. arXiv preprint arXiv:2305.15486, 2023.
- Pretraining in deep reinforcement learning: A survey. arXiv preprint arXiv:2211.03959, 2022.
- Prompting decision transformer for few-shot policy generalization. In International Conference on Machine Learning, pp. 24631–24645. PMLR, 2022.
- Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline rl. arXiv preprint arXiv:2209.03993, 2022.
- Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435, 2022a.
- Chain of thought imitation with procedure cloning. Advances in Neural Information Processing Systems, 35:36366–36381, 2022b.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Towards playing full moba games with deep reinforcement learning. Advances in Neural Information Processing Systems, 33:621–632, 2020a.
- Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 6672–6679, 2020b.
- The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955, 2021a.
- Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021b.
- Deep reinforcement learning with relational inductive biases. In International conference on learning representations, 2018.
- Online decision transformer. arXiv preprint arXiv:2202.05607, 2022.
- Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11106–11115, 2021.
- Scaling pareto-efficient decision making via offline multi-objective rl. arXiv preprint arXiv:2305.00567, 2023.
- Long-short transformer: Efficient transformers for language and vision. Advances in neural information processing systems, 34:17723–17736, 2021.