Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics (1903.04193v2)

Published 11 Mar 2019 in cs.LG and cs.AI

Abstract: Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ai-lab/bdpi.

Citations (16)

View on Semantic Scholar

Summary

The paper introduces BDPI, which decouples actor and critic training with multiple off-policy critics to significantly boost sample efficiency.
It employs Aggressive Bootstrapped Clipped DQN to stabilize critic updates and mitigate overestimation biases.
Experimental results demonstrate that BDPI outperforms traditional methods in exploration-intensive and sparse reward environments while requiring less hyper-parameter tuning.

Overview of Bootstrapped Dual Policy Iteration: Sample-Efficient Model-Free Reinforcement Learning

The paper "Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics" introduces Bootstrapped Dual Policy Iteration (BDPI), a novel reinforcement learning (RL) algorithm designed for tasks characterized by continuous states and discrete actions. The BDPI algorithm enhances sample efficiency by integrating multiple off-policy critics within an actor-critic framework. This approach decouples the learning of the actor and the critics, resulting in a substantial increase in both the stability and robustness of the algorithm against variations in hyper-parameters.

Contributions and Methodology

The primary contribution of BDPI involves the creation of off-policy critics for reinforcement learning. Conventional actor-critic frameworks depend on on-policy critics, making them less compatible with experience replay mechanisms that significantly boost sample efficiency. By eliminating the necessity of on-policy critics, BDPI leverages a wider array of value-based methods to improve performance in RL tasks.

BDPI's methodology involves two key components:

Aggressive Bootstrapped Clipped DQN (ABCDQN): BDPI incorporates several off-policy critics trained with an algorithm inspired by Bootstrapped DQN and Clipped Double Q-Learning. These critics possess a higher sample efficiency, although they may be prone to overfitting. Each critic maintains two value functions and trains using a variant of the Q-learning algorithm to secure stable updates and limit overestimation biases.
Actor Training with Off-Policy Critics: The actor in BDPI adopts an innovation from Conservative Policy Iteration, although it diverges in key areas to accommodate off-policy learning. The actor learns by approximating the greedy policies of individual critics, enhancing exploration quality akin to Thompson sampling.

Experimental Results

BDPI demonstrates superior sample efficiency compared to existing algorithms such as Bootstrapped DQN, Proximal Policy Optimization (PPO), and Actor-Critic using Kronecker-Factored Trust Region (ACKTR). Experimental evaluations were conducted across a diverse selection of environments with distinct challenges in state-spaces and task dynamics, such as sparse reward scenarios and high-dimensional tasks.

Numerical results indicate that BDPI outperforms these baseline methods, particularly in exploration-intensive environments. The algorithm's robustness to hyper-parameter tuning further underscores its versatility and ease of deployment in various scenarios. Additionally, the use of multiple critics and the absence of a necessity for intensive parameter tuning set BDPI apart in practical applications.

Implications and Speculations on Future Developments

Practically, BDPI's decoupled learning framework marks significant progress in creating RL systems that demand fewer interactions with the environment, an essential trait for applications where data collection is onerous or expensive. Theoretically, it opens avenues for researching reinforcement learning systems that conjoin off-policy and on-policy learning, enhancing exploration without compromising stability.

Future developments could investigate BDPI's application to continuous action spaces. The preliminary results suggest that BDPI adjusted using discretization strategies could outshine current state-of-the-art continuous-action RL methods like Soft Actor-Critic and TD3. This extension would further solidify BDPI's utility in complex, real-world tasks involving intricate continuous control.

To conclude, Bootstrapped Dual Policy Iteration represents a significant contribution to the field of reinforcement learning, particularly in terms of sample efficiency and the practicality of deploying RL algorithms across varied environments without laborious parameter tuning. Continued exploration into BDPI's strategies for action discretization and application across more complex domains will likely yield valuable insights into developing more robust RL frameworks.

PDF Markdown

Related Papers

YouTube

Show All Videos