Doubly Robust Off-policy Value Evaluation for Reinforcement Learning (1511.03722v3)

Published 11 Nov 2015 in cs.LG, cs.AI, cs.SY, stat.ME, and stat.ML

Abstract: We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. This problem is often a critical step when applying RL in real-world problems. Despite its importance, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators. We demonstrate the estimator's accuracy in several benchmark problems, and illustrate its use as a subroutine in safe policy improvement. We also provide theoretical results on the hardness of the problem, and show that our estimator can match the lower bound in certain scenarios.

Citations (589)

View on Semantic Scholar

Summary

The paper demonstrates that extending the Doubly Robust estimator to sequential decision problems reduces bias and variance in policy evaluation.
It combines regression-based and importance sampling techniques to effectively manage distribution mismatches between behavior and target policies.
Empirical results on benchmarks like Mountain Car and the KDD Cup 1998 Donation dataset confirm its superior performance over traditional methods.

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

The paper addresses the challenge of off-policy value evaluation in reinforcement learning (RL), a pertinent issue when estimating the value of a policy using data generated from a different policy. This scenario is common in practical RL applications where direct policy evaluation through deployment can be infeasible due to associated risks or costs.

Traditional approaches either model the Markov Decision Process (MDP) via regression or utilize Importance Sampling (IS) to account for distribution shifts between behavior and target policies. However, these techniques suffer from high variance or bias challenges. The proposed method extends the Doubly Robust (DR) estimator, originally developed for contextual bandits, to sequential decision problems, aiming to mitigate these issues.

Methodology

The DR estimator is designed to handle uncertainty in both the modeling and sampling processes, combining the advantages of regression-based (lower variance) and IS-based (no bias) approaches. The authors propose a simple recursive form for the estimator addressing the inherent distributional mismatch in policy evaluation.

Crucially, the paper analyzes the variance properties of the DR estimator, demonstrating its statistical benefits over traditional methods. DR's variance closely aligns with the Cramer-Rao lower bound under certain conditions, indicating its efficiency. This property ensures that the DR estimator optimally balances bias and variance given the correct model specification.

Strong Numerical Results

Empirical validation on various benchmark problems, such as Mountain Car and Sailing, showcases the DR estimator's superior accuracy compared to other methods. In these tasks, the DR estimator outperforms IS and WIS, notably when the behavior policy diverges from the target policy. Additionally, the inclusion of simulated experiments on the KDD Cup 1998 Donation dataset further exhibits its practical benefits.

Theoretical and Practical Implications

The research underscores the inherent difficulty in off-policy evaluation, laying out the theoretical bounds on estimation accuracy. The DR estimator presented offers a compelling method for reducing variance without increasing bias, especially in complex MDPs where transition dynamics can be reliably estimated.

From a practical standpoint, the ability to perform accurate off-policy evaluation allows safer and more reliable policy improvements in real-world applications. This advantage translates to increased trust in deploying learned policies in scenarios like medical treatment plans or robotics, where decision errors can be costly.

Future Directions

The DR estimator's application to broader RL problems and its integration with modern RL algorithms, such as deep reinforcement learning, could provide further advancements in the field. Exploration into how DR can be practically integrated into existing RL workflows could enhance the robustness of decision-making systems in data-scarce environments.

In conclusion, this work provides a substantial contribution to the field of off-policy RL by introducing a method that effectively balances bias and variance, positioning it as a robust choice for policy evaluation and improvement in both academic and industrial settings.

PDF Markdown