Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation (1810.12429v1)

Published 29 Oct 2018 in cs.LG, cs.AI, cs.SY, and stat.ML

Abstract: We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

Citations (331)

View on Semantic Scholar

Summary

The paper introduces a novel off-policy estimator that applies importance sampling directly to stationary state-visitation distributions, reducing variance in infinite-horizon scenarios.
The paper employs a mini-max loss formulation in an RKHS framework to estimate density ratios, yielding a closed-form solution with theoretical guarantees for unbiased reward estimation.
The paper demonstrates through extensive empirical analyses that the proposed method outperforms traditional IS techniques, offering practical benefits in risk-sensitive applications.

Insights into Infinite-Horizon Off-Policy Estimation

The paper "Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation" tackles the intricate problem of estimating the expected reward of a target policy using off-policy samples in reinforcement learning (RL), particularly in the context of infinite-horizon scenarios. The authors address the limitations of traditional importance sampling (IS) methods, which suffer from high variance, especially in long-horizon or infinite-horizon problems where the variance can become unbounded.

Contribution and Key Innovations

The core contribution of the paper is the development of a novel off-policy estimator that applies IS directly on the stationary state-visitation distributions. This method effectively avoids the significant variance associated with existing methods that rely on the product of importance ratios across many steps in a trajectory, thus alleviating what the authors term the "curse of horizon."

The authors introduce an approach to estimate the density ratio of two stationary state distributions using samples solely from the behavior distribution. This involves the formulation of a mini-max loss function for the estimation problem and deriving a closed-form solution within the framework of reproducing kernel Hilbert space (RKHS).

Theoretical and Empirical Analysis

The paper provides both theoretical and empirical analyses to support the proposed method. Theoretically, the authors show that their approach yields an unbiased estimator of the expected reward under certain conditions. The new estimator's variance does not depend on the trajectory length, making it theoretically more sound than previous methods for infinite-horizon problems.

In the empirical analysis, the authors demonstrate the effectiveness of their method on various environments compared to traditional IS and WIS methods. The results show consistent performance improvements in terms of estimation accuracy, particularly in long-horizon settings.

Numerical Results and Validation

The paper presents strong numerical results, showing a significant reduction in variance and improved accuracy over trajectory-wise and step-wise IS methods. These improvements are consistently observed across different experimental settings.

Importantly, the paper addresses practical concerns, such as when off-policy data is available, but using the target policy directly is infeasible due to cost or risk. The method is particularly relevant in domains such as medical treatment optimization or safety-critical systems where extensive exploration under the target policy could be costly or dangerous.

Implications and Future Directions

The findings of this research have significant implications for the field of artificial intelligence, particularly in the development of more efficient RL algorithms that can leverage off-policy data in complex environments. The proposed estimator can be employed as a key component in off-policy policy optimization algorithms, potentially pushing the boundary of current RL applications.

For future work, it would be interesting to explore the scalability of this approach to larger state and action spaces or apply it to value function estimation and policy optimization. Additionally, extending theoretical guarantees and exploring connections with other variance-reduction techniques could further strengthen the impact of this work.

The methodology presented in this paper provides a promising direction in overcoming the challenges inherent in infinite-horizon problems, offering a robust framework for reliable off-policy estimation in RL.

PDF Markdown