Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy (2404.01830v1)

Published 2 Apr 2024 in stat.ML and cs.LG

Abstract: We introduce a novel doubly-robust (DR) off-policy evaluation (OPE) estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown. The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the asymptotic variance of the estimator while considering the estimating effect of the logging policy. When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class containing existing OPE estimators. When the value function model is also correctly specified, DRUnknown is optimal as its asymptotic variance reaches the semiparametric lower bound. We present experimental results conducted in contextual bandits and reinforcement learning to compare the performance of DRUnknown with that of existing methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011.
  2. Off-policy evaluation for slate recommendation. Advances in Neural Information Processing Systems, 30, 2017.
  3. Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
  4. Weighted importance sampling for off-policy learning with linear function approximation. Advances in neural information processing systems, 27, 2014.
  5. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661. PMLR, 2016.
  6. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016.
  7. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597. PMLR, 2017.
  8. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018.
  9. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167–9176. PMLR, 2020.
  10. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734, 2009.
  11. Daniel B Rubin and Mark J van der Laan. Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics, 4(1), 2008.
  12. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608–616. PMLR, 2015.
  13. Behaviour policy estimation in off-policy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066, 2018.
  14. Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=HklKui0ct7.
  15. Importance sampling in reinforcement learning with an estimated behavior policy. Machine Learning, 110(6):1267–1317, 2021.
  16. Christoph Rothe. The value of knowing the propensity score for estimating average treatment effects. Available at SSRN 2797560, 2016.
  17. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
  18. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
  19. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
  20. B. German. Glass Identification. UCI Machine Learning Repository, 1987. DOI: https://doi.org/10.24432/C5WW2P.
  21. David Slate. Letter Recognition. UCI Machine Learning Repository, 1991. DOI: https://doi.org/10.24432/C5ZP40.
  22. Richard Forsyth. Zoo. UCI Machine Learning Repository, 1990. DOI: https://doi.org/10.24432/C5R59V.
  23. Image Segmentation. UCI Machine Learning Repository, 1990. DOI: https://doi.org/10.24432/C5GP4N.
  24. R. A. Fisher. Iris. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C56C76.
  25. E. Alpaydin and C. Kaynak. Optical Recognition of Handwritten Digits. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50P49.
  26. Double reinforcement learning for efficient and robust off-policy evaluation. In International Conference on Machine Learning, pages 5078–5088. PMLR, 2020.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com