Reinforcement Learning with Action-Free Pre-Training from Videos

Published 25 Mar 2022 in cs.CV and cs.AI | (2203.13880v2)

Abstract: Recent unsupervised pre-training methods have shown to be effective on language and vision domains by learning useful representations for multiple downstream tasks. In this paper, we investigate if such unsupervised pre-training methods can also be effective for vision-based reinforcement learning (RL). To this end, we introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos. Our framework consists of two phases: we pre-train an action-free latent video prediction model, and then utilize the pre-trained representations for efficiently learning action-conditional world models on unseen environments. To incorporate additional action inputs during fine-tuning, we introduce a new architecture that stacks an action-conditional latent prediction model on top of the pre-trained action-free prediction model. Moreover, for better exploration, we propose a video-based intrinsic bonus that leverages pre-trained representations. We demonstrate that our framework significantly improves both final performances and sample-efficiency of vision-based RL in a variety of manipulation and locomotion tasks. Code is available at https://github.com/younggyoseo/apv.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (103)

View on Semantic Scholar

Summary

The paper proposes a novel two-phase framework using action-free videos for unsupervised pre-training to improve vision-based reinforcement learning.
The approach involves pre-training a latent video prediction model without action information, then fine-tuning with an action-conditional model and an intrinsic bonus for exploration.
Experiments show significant performance gains and improved sample efficiency on Meta-world and DeepMind Control Suite tasks by transferring representations learned from diverse video datasets.

Reinforcement Learning with Action-Free Pre-Training from Videos

The paper "Reinforcement Learning with Action-Free Pre-Training from Videos" presents a novel approach to improve the sample-efficiency and performance of vision-based reinforcement learning (RL) agents by leveraging videos from diverse domains for unsupervised pre-training. The methodology is structured around pre-training a model on action-free videos and fine-tuning it for specific RL tasks, bridging the gap between pre-training in computer vision (CV) and NLP domains and its application in RL.

Framework Overview

The framework proposed in the paper is two-phased:

Action-Free Pre-Training: The initial phase involves training a latent video prediction model without using action information. This unsupervised pre-training focuses on capturing the dynamics present in the video content, without requiring a labeled dataset or action annotation typically needed in RL tasks. The proposed model encodes observations into latent states and then predicts future latent states without relying on images, which is computationally efficient.
Fine-Tuning with Action-Conditional Model: Once the model is pre-trained, a novel architecture stacks an action-conditional latent prediction model on top of the pre-trained model. This approach transfers the learned representations to downstream RL tasks by incorporating action information during fine-tuning. A video-based intrinsic bonus for exploration is also introduced, utilizing the pre-trained representations to encourage the agent to explore diverse behaviors.

Experimental Results

The experimental evaluation is conducted across various tasks, demonstrating the efficacy of the framework:

Meta-world Manipulation Tasks: Using videos from RLBench for pre-training, the agent shows significant improvements over existing methods like DreamerV2, with a notable increase in success rates across multiple diverse tasks.
DeepMind Control Suite: Pre-training with manipulation videos, distinct from the fine-tuning locomotion tasks tested, results in considerable performance gains, emphasizing the model's capacity to generalize across different task domains.

Contributions and Implications

The paper achieves several significant contributions to the field of reinforcement learning:

Efficient Representation Transfer: By utilizing action-free videos for pre-training, the approach efficiently transfers learned representations to novel tasks, enhancing the sample efficiency of RL agents.
Scalability and Domain Independence: The ability to pre-train on diverse datasets without domain-specific action labels highlights the method's scalability and potential applicability across various autonomous systems.
Future Directions in AI: This framework provides a promising direction for future research, possibly integrating more complex video datasets, incorporating advanced video prediction models, and exploring other pre-training objectives such as masked prediction or contrastive learning.

Conclusion

The investigation conducted in this paper sheds light on the unexplored avenue of utilizing action-free video pre-training to enhance the capabilities of RL systems. By demonstrating the transferability and efficacy of action-free pre-trained models in vision-based RL, this work opens pathways for further research into autonomous learning systems capable of leveraging diverse, unstructured data sources. Future advancements could include scaling pre-training models, incorporating diverse real-world datasets, and developing more sophisticated pre-training frameworks to further blur the lines between RL and unsupervised representation learning.

Markdown Report Issue