Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning (1906.03735v1)

Published 9 Jun 2019 in cs.LG and stat.ML

Abstract: Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ensure semiparametric local efficiency if Q-functions are well-specified, but if they are not they can be worse than both IS and SNIS. It also does not enjoy SNIS's inherent stability and boundedness. We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS. On the way, we categorize various properties and classify existing estimators by them. Besides the theoretical guarantees, empirical studies suggest the new estimators provide advantages.

Citations (52)

View on Semantic Scholar

Summary

The paper introduces novel off-policy evaluation estimators that achieve intrinsic efficiency by surpassing IS, SNIS, and DR methods.
It leverages REG and EMP techniques to optimize parameters via regression and empirical likelihood, reducing asymptotic MSE.
Experimental results confirm improved stability, boundedness, and performance in reinforcement learning and contextual bandit settings.

Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Introduction and Background

The paper "Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning" (1906.03735) tackles key challenges in off-policy evaluation (OPE), a problem central to both reinforcement learning (RL) and contextual bandits (CB). OPE involves assessing a policy's value using data generated by a different policy, typically without exploration, which is costly or impractical in many applications such as healthcare and education. Traditional methods like importance sampling (IS) and its variants, self-normalized IS (SNIS) and doubly robust (DR) estimators, while effective, have limitations in efficiency, stability, and boundedness. The authors propose new OPE estimators based on empirical likelihood to address these issues, ensuring efficiency superior to IS, SNIS, and DR while maintaining stability and boundedness.

Existing Methods and Their Limitations

Current OPE techniques fall into three major categories: the direct method (DM), importance sampling (IS), and doubly robust (DR) method. DM estimates the Q-function directly through regression, and while it is locally efficient when well-specified, it is inconsistent when the model is misspecified. IS, although unbiased, suffers from high variance, prompting the use of SNIS. SNIS offers boundedness and stability, crucial when density ratios are variable due to low overlap. DR combines DM and IS, achieving local efficiency if the Q-function is well-specified, but it can perform worse than IS and SNIS if the Q-function is misspecified. Furthermore, DR lacks SNIS's inherent stability and boundedness.

Proposed Estimators

The novel estimators proposed aim to provide intrinsic efficiency, stability, and boundedness. Intrinsic efficiency is defined as performing better than IS, SNIS, and DR regardless of model specification. The paper introduces REG and EMP methods, which incorporate a parameterized class of estimators including IS, SNIS, and DR, optimizing the parameter through regression (REG) or empirical likelihood (EMP). These approaches ensure improvements in asymptotic mean squared error (MSE), boundedness, and stability.

Markov Decision Processes and OPE Framework

For RL, the Markov Decision Process (MDP) is a foundational framework defined by states, actions, rewards, transition probabilities, and a policy determining the distribution over actions for each state. The OPE problem focuses on estimating the expected return of a policy using trajectories generated by a behavior policy. The paper enhances existing estimators by optimizing control variates within the trajectory, using techniques that extend traditional IS and DR estimators, providing robust statistical properties.

Theoretical Guarantees and Properties

The paper provides thorough theoretical guarantees, ensuring the proposed estimators meet desired criteria such as local and intrinsic efficiency, $\alpha$ -boundedness, and stability. These properties are validated through the derived asymptotic MSE formulas, which show improvements over traditional methods. REG achieves these by minimizing estimated variance, while EMP leverages empirical likelihood to select parameters that satisfy efficiency and boundedness requirements.

Experimental Evaluation

Experiments highlight the practical performance of REG and EMP in both CB and RL settings, using various datasets and environments. The results demonstrate significant improvements in RMSE compared to existing methods, showing that these estimators consistently outperform alternatives across different scenarios and are particularly effective when model specifications are uncertain or misspecified.

Conclusion

The paper suggests that robust modifications to existing OPE techniques can lead to substantial practical and theoretical advancements in evaluating policies in RL and CB. By ensuring intrinsic efficiency, stability, and boundedness, the proposed methods provide a reliable framework for policy evaluation without the instability associated with traditional approaches. Future research may consider enhancements through hybrid estimators or additional constraints to further refine efficiency and stability in OPE tasks.