Statistical Efficiency of Distributional Temporal Difference Learning (2403.05811v3)
Abstract: Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta\pi$ for a given policy $\pi$. The distributional temporal difference learning has been accordingly proposed, which is an extension of the temporal difference learning (TD) in the classic RL area. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference learning (CTD) and quantile temporal difference learning (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD learning (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need $\tilde{O}\left(\frac{1}{\varepsilon{2p}(1-\gamma){2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. This sample complexity bound is minimax optimal up to logarithmic factors in the case of the $1$-Wasserstein distance. To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance for $p\geq 1$.
- A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. PMLR, 2017.
- Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.
- M. Böck and C. Heitzinger. Speedy categorical distributional reinforcement learning and complexity analysis. SIAM Journal on Mathematics of Data Science, 4(2):675–693, 2022. doi: 10.1137/20M1364436. URL https://doi.org/10.1137/20M1364436.
- Superhuman performance on sepsis mimic-iii data by distributional reinforcement learning. PLoS One, 17(11):e0275358, 2022.
- V. I. Bogachev. Measure theory, volume 1. Springer, 2007.
- Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Speedy q-learning. Advances in neural information processing systems, 24, 2011.
- Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91:325–349, 2013.
- There is a risk-return trade-off after all. Journal of financial economics, 76(3):509–548, 2005.
- S. M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003.
- A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning, 49:193–208, 2002.
- P. W. Lavori and R. Dawson. Dynamic treatment regimes: practical design considerations. Clinical trials, 1(1):9–20, 2004.
- Is q-learning minimax optimal? a tight sample complexity analysis. Operations Research, 72(1):222–236, 2024.
- S. Luo. On azuma-type inequalities for banach space-valued martingales. Journal of Theoretical Probability, 35(2):772–800, Jun 2022. ISSN 1572-9230. doi: 10.1007/s10959-021-01086-5. URL https://doi.org/10.1007/s10959-021-01086-5.
- Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 799–806, 2010.
- G. Pisier. Martingales in Banach Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2016. doi: 10.1017/CBO9781316480588.
- H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- N. Ross. Fundamentals of stein’s method. Probability Surveys, 8:210–293, 2011.
- An analysis of categorical distributional reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 29–37. PMLR, 2018.
- An analysis of quantile temporal-difference learning. arXiv preprint arXiv:2301.04462, 2023.
- Near-minimax-optimal distributional reinforcement learning with a generative model. arXiv preprint arXiv:2402.07598, 2024.
- R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988.
- R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. doi: 10.1017/9781108231596.
- Distributional offline policy evaluation with predictive error guarantees. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 37685–37712. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/wu23s.html.
- Estimation and inference in distributional reinforcement learning. arXiv preprint arXiv:2309.17262, 2023.
- Yang Peng (61 papers)
- Liangyu Zhang (9 papers)
- Zhihua Zhang (118 papers)