Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data (2306.14063v2)
Abstract: Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Currently, most results hinge on unrealistic assumptions about the data distribution -- namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We consider a more general setting where the dataset may have been gathered adaptively. We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator in this generalized setting for tabular MDPs, deriving high-probability, instance-dependent bounds on its estimation error. We also recover minimax-optimal offline learning in the adaptive setting. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive regimes.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
- Real-time bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 661–670, 2017.
- Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
- Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pp. 2701–2709. PMLR, 2020.
- Doubly robust policy evaluation and learning, 2011.
- Doubly robust off-policy value evaluation for reinforcement learning, 2016.
- Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, ICML 2020, volume 119, pp. 4870–4879. PMLR, 2020.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084–5096. PMLR, 2021.
- Policy learning "without" overlap: Pessimism and generalized empirical bernstein’s inequality. CoRR, abs/2212.09900, 2022. doi: 10.48550/arXiv.2212.09900. URL https://doi.org/10.48550/arXiv.2212.09900.
- Optimal off-policy evaluation from multiple logging policies, 2020. URL https://arxiv.org/abs/2010.11002.
- Deep reinforcement learning for autonomous driving: A survey, 2020. URL https://arxiv.org/abs/2002.00444.
- The challenges of exploration for offline reinforcement learning. CoRR, abs/2201.11861, 2022.
- Batch reinforcement learning. In Reinforcement learning, pp. 45–73. Springer, 2012.
- Batch policy learning under constraints. In Chaudhuri, K. and Salakhutdinov, R. (eds.), International Conference on Machine Learning, ICML 2019, volume 97, pp. 3703–3712. PMLR, 2019.
- Off-policy estimation of linear functionals: Non-asymptotic theory for semi-parametric efficiency, 2022. URL https://arxiv.org/abs/2209.13075.
- Near-optimal deployment efficiency in reward-free reinforcement learning with linear function approximation. arXiv preprint arXiv:2210.00701, 2022a.
- Offline reinforcement learning with differential privacy. arXiv preprint arXiv:2206.00810, 2022b.
- Sample-efficient reinforcement learning with loglog(T) switching cost. In International Conference on Machine Learning, pp. 18031–18061. PMLR, 2022.
- Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In Machine Learning for Healthcare Conference, pp. 147–163, 2017.
- Behaviour policy estimation in off-policy policy evaluation: Calibration matters. CoRR, abs/1807.01066, 2018. URL http://arxiv.org/abs/1807.01066.
- Assessing the performance of online students – new data, new approaches, improved accuracy, 2021. URL https://arxiv.org/abs/2109.01753.
- Are sample means in multi-armed bandits positively or negatively biased? In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 7100–7109, 2019.
- Data-efficient off-policy policy evaluation for reinforcement learning, 2016.
- On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020.
- Optimal and adaptive off-policy evaluation in contextual bandits, 2017.
- The curse of passive data collection in batch reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 8413–8438. PMLR, 2022.
- Batch value-function approximation with only realizability. In Meila, M. and Zhang, T. (eds.), International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pp. 11404–11413. PMLR, 2021.
- Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Advances in Neural Information Processing Systems, 32, 2019.
- Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 3948–3958. PMLR, 2020.
- Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 1567–1575. PMLR, 2021.
- Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
- Reinforcement learning in healthcare: A survey, 2019. URL https://arxiv.org/abs/1908.08796.
- Off-policy fitted q-evaluation with differentiable function approximators: Z-estimation and inference theory. arXiv preprint arXiv:2202.04970, 2022.
- Sunil Madhow (1 paper)
- Dan Qiao (26 papers)
- Ming Yin (70 papers)
- Yu-Xiang Wang (124 papers)