Curiosity-driven Exploration by Self-supervised Prediction

Published 15 May 2017 in cs.LG, cs.AI, cs.CV, cs.RO, and stat.ML | (1705.05363v1)

Abstract: In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life. We formulate curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model. Our formulation scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and, critically, ignores the aspects of the environment that cannot affect the agent. The proposed approach is evaluated in two environments: VizDoom and Super Mario Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where curiosity allows for far fewer interactions with the environment to reach the goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent to explore more efficiently; and 3) generalization to unseen scenarios (e.g. new levels of the same game) where the knowledge gained from earlier experience helps the agent explore new places much faster than starting from scratch. Demo video and code available at https://pathak22.github.io/noreward-rl/

Abstract PDF Upgrade to Chat

Citations (2,270)

View on Semantic Scholar

Summary

The paper presents a novel intrinsic reward mechanism that uses prediction error in a learned feature space to guide exploration in RL.
It employs a self-supervised inverse dynamics model and forward dynamics prediction to overcome challenges in high-dimensional, stochastic environments.
Experiments in VizDoom and Super Mario Bros demonstrate enhanced exploration, generalization, and unsupervised skill discovery compared to traditional A3C baselines.

Curiosity-driven Exploration by Self-supervised Prediction

This paper explores the issue of reinforcement learning (RL) in environments where external rewards are sparse or nonexistent. The authors propose a novel approach to leverage curiosity as an intrinsic reward signal to guide exploratory behavior. Traditionally, RL relies on external rewards for policy updates, but in many real-world scenarios, these rewards are inadequately sparse, necessitating an alternative mechanism for learning effective behaviors. The authors approach this problem by redefining curiosity in terms of prediction errors within a self-supervised framework.

Methodology

The proposed method formulates curiosity as the agent's error in predicting the outcome of its actions. This prediction is done not in the raw pixel space but in a learned feature space derived from a self-supervised inverse dynamics model. This feature space encompasses aspects of the environment relevant to the agent's actions while ignoring irrelevant factors. By avoiding direct pixel prediction, this approach addresses key challenges in high-dimensional continuous state spaces, making it robust to environmental complexities and stochasticity.

The primary components of the methodology include:

Learning Feature Space: Using self-supervised learning, a neural network is trained to predict the agent's actions given its current and subsequent states. This network learns a feature embedding that abstracts away irrelevant aspects of the environment.
Forward Dynamics Model: Another model is trained to predict the next state in the learned feature space using the current state and the chosen action. The intrinsic reward signal is derived from the prediction error of this forward model, effectively guiding the agent's curiosity-driven exploration.
Policy Optimization: An asynchronous advantage actor critic (A3C) reinforcement learning algorithm is employed to optimize the policy, leveraging both the intrinsic curiosity rewards and any infrequent extrinsic rewards.

Experimental Setup and Results

The authors evaluated their approach in two distinct environments: VizDoom, a 3D navigation task, and Super Mario Bros, a side-scrolling game. They considered three settings:

Sparse Extrinsic Reward: In VizDoom, where the goal is to navigate a complex environment to find a reward. Their agent outperformed the baseline A3C in navigating efficiently towards the goal even with very sparse rewards, demonstrating better exploration capabilities.
No Extrinsic Reward: The methodology was assessed on its own merit where no environmental rewards were provided. The agent still learned to explore effectively, covering significant portions of the environment and discovering behaviors like avoiding obstacles in Mario without any explicit rewards.
Generalization to Novel Scenarios: The approach was tested on its ability to generalize learned exploratory behaviors to new, unseen environments. In VizDoom, the agent trained only on an exploration policy showed superior performance when fine-tuned on novel maps with different textures. Similarly, in Mario, the agent transferred knowledge from Level-1 to subsequent levels, performing better than one trained from scratch.

Implications and Future Directions

The implications of this work are manifold:

Scalability: The proposed method scales effectively to environments where traditional pixel-based prediction fails, making it applicable in real-world RL scenarios with complex, high-dimensional observations.
Robustness: By focusing on regions of the environment that can affect the agent, the method is robust to nuisance factors, ensuring stable learning trajectories even in the presence of environmental stochasticity.
Unsupervised Skill Discovery: The ability to learn useful exploratory behaviors without explicit rewards suggests potential for unsupervised skill discovery, paving the way for autonomous agents capable of pre-training in generic environments before task-specific fine-tuning.

Future research could explore the integration of this curiosity module with hierarchical RL frameworks, enhancing performance on more complex tasks by leveraging pre-learned behaviors as building blocks. Another promising direction is the application to transfer learning scenarios, where the agent can generalize its learned exploration strategies to entirely new domains, potentially reducing the need for extensive environmental sampling in each new task. Long-term, this line of work moves towards achieving autonomous agents that can independently explore and adapt to a wide array of real-world applications.

Markdown Report Issue