Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

Published 6 Mar 2017 in cs.LG | (1703.01732v1)

Abstract: Exploration in complex domains is a key challenge in reinforcement learning, especially for tasks with very sparse rewards. Recent successes in deep reinforcement learning have been achieved mostly using simple heuristic exploration strategies such as $\epsilon$-greedy action selection or Gaussian control noise, but there are many tasks where these methods are insufficient to make any learning progress. Here, we consider more complex heuristics: efficient and scalable exploration strategies that maximize a notion of an agent's surprise about its experiences via intrinsic motivation. We propose to learn a model of the MDP transition probabilities concurrently with the policy, and to form intrinsic rewards that approximate the KL-divergence of the true transition probabilities from the learned model. One of our approximations results in using surprisal as intrinsic motivation, while the other gives the $k$-step learning progress. We show that our incentives enable agents to succeed in a wide range of environments with high-dimensional state spaces and very sparse rewards, including continuous control tasks and games in the Atari RAM domain, outperforming several other heuristic exploration techniques.

Abstract PDF Upgrade to Chat

Citations (229)

View on Semantic Scholar

Summary

The paper introduces a surprise-driven intrinsic reward framework that enhances exploration in environments with sparse rewards.
It quantifies surprise using the KL divergence between true and estimated dynamics to guide better agent exploration.
Empirical results across continuous control and Atari tasks show superior performance and reduced computational load compared to traditional methods.

An Analysis of Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

The paper "Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning" by Joshua Achiam and Shankar Sastry addresses the longstanding challenge in reinforcement learning (RL) of effectively exploring environments characterized by sparse reward distributions. Recent approaches to deep reinforcement learning have largely leveraged simple heuristic strategies such as $\epsilon$ -greedy action selection, yet these methods often struggle in domains where rewards are distributed sparsely across vast state spaces. This paper proposes an advanced exploration strategy that employs intrinsic motivation grounded in surprise to address these challenges.

The authors develop a framework wherein the RL agent's exploration is driven by surprise, which is quantified via the Kullback-Leibler (KL) divergence between the transition probabilities of the true Markov Decision Process (MDP) and those approximated by a model being learned concurrently with the policy. Specifically, two approximations of the intrinsic reward derived from surprise are proposed: 1) surprisals, representing the negative log probabilities of state transitions, and 2) $k$ -step learning progress, aligned more closely with a Bayesian conceptualization of surprise.

Empirical results demonstrate the efficacy of these proposed methods, particularly for environments with high-dimensional state spaces and sparse rewards. The authors report that surprise-based intrinsic motivation allows agents to outperform other exploration techniques across several benchmarks, including continuous control tasks and Atari games. The success of the surprisal bonus is especially noteworthy, achieving consistent results across different tasks while maintaining computational efficiency compared to existing methods like Variational Information Maximizing Exploration (VIME).

Theoretical and Computational Considerations

The paper builds on the notion that an effective exploration strategy needs to balance the exploration-exploitation trade-off, which is critically dependent on how the RL agent perceives unfamiliar states. By focusing on intrinsic rewards formulated in alignment with surprise metrics, the authors provide a parallel track of incentive for exploration unidimensional from the extrinsic rewards from the environment.

Furthermore, the authors offer a scalable approach with reduced computational overhead compared to methods like VIME. Through empirical evaluations, it is noted that surprisals require only forward passes for reward computation, significantly lowering the computational load versus VIME's requirement of both forward and backward passes through a Bayesian neural network.

Practical Implications and Future Directions

Practically, surprise-based intrinsic motivation can facilitate more robust and generalized learning paradigms across RL domains. The demonstrated flexibility and efficiency imply significant strides in applications involving sparse rewards, potentially extending from robotics to autonomous systems and complex strategic gaming.

Theoretically, this approach opens avenues for further exploration into the application of information-theoretic measures for intrinsic rewards. Future work could focus on improving the precision of model learning, refining dynamics models to better approximate true environment dynamics, or integrating hybrid methods combining multiple forms of intrinsic motivation. Further investigation might also consider exploring the theoretical properties that enable these surprise-based mechanisms to succeed in environments with stochastic dynamics.

Achiam and Sastry's contributions mark a meaningful advancement in the exploration strategies of deep reinforcement learning, providing a foundation for future developments in intrinsic motivation frameworks that hold the promise of unlocking more of the potential in sparse reward RL environments.

Markdown Report Issue