An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks (2405.04017v1)
Abstract: Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general $L$-layer neural network. New proof techniques are developed and an improved new $\tilde{\mathcal{O}}(\epsilon{-1})$ sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an $\tilde{\mathcal{O}}(\epsilon{-1})$ complexity under the Markovian sampling, as opposed to the best known $\tilde{\mathcal{O}}(\epsilon{-2})$ complexity in the existing literature.
- Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019a.
- A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019b.
- Analysis of a target-based actor-critic algorithm with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp. 991–1040. PMLR, 2022.
- Bertsekas, D. Dynamic programming and optimal control: Volume I, volume 1. Athena scientific, 2012.
- A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pp. 1691–1692. PMLR, 2018.
- Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
- Rational and convergent learning in stochastic games. In International joint conference on artificial intelligence, volume 17, pp. 1021–1026. Citeseer, 2001.
- Boyan, J. A. Technical update: Least-squares temporal difference learning. Machine learning, 49(2):233–246, 2002.
- Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1):33–57, 1996.
- Geometric insights into the convergence of nonlinear td learning. arXiv preprint arXiv:1905.12185, 2019.
- Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Neural temporal difference and q learning provably converge to global optima. Mathematics of Operations Research, 2023.
- Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.
- Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3349–3356, 2020.
- Sample complexity and overparameterization bounds for temporal difference learning with neural network approximation. IEEE Transactions on Automatic Control, 2023.
- Finite sample analyses for td (0) with function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675–1685. PMLR, 2019.
- Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- A theoretical analysis of deep q-learning. In Learning for dynamics and control, pp. 486–489. PMLR, 2020.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
- Godfrey, L. B. An evaluation of parametric activation functions for deep learning. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 3006–3011. IEEE, 2019.
- Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 6, 1993.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Provably efficient gauss-newton temporal difference learning method with function approximation. arXiv preprint arXiv:2302.13087, 2023.
- Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
- Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021.
- Finite-sample analysis of lstd. In ICML-27th International Conference on Machine Learning, pp. 615–622, 2010.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163. Elsevier, 1994.
- Finite-sample analysis of proximal gradient td algorithms. arXiv preprint arXiv:2006.14364, 2020a.
- On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964, 2020b.
- Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in neural information processing systems, 22, 2009.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- On the existence of fixed points for q-learning and sarsa in partially observable domains. In ICML, pp. 490–497, 2002.
- Actor-critic fictitious play in simultaneous move multistage games. In International Conference on Artificial Intelligence and Statistics, pp. 919–928. PMLR, 2018.
- Fast lstd using stochastic approximation: Finite time analysis and application to traffic control. In Joint European conference on machine learning and knowledge discovery in databases, pp. 66–81. Springer, 2014.
- Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
- Finite-time analysis of adaptive temporal difference learning with deep neural networks. Advances in Neural Information Processing Systems, 35:19592–19604, 2022.
- Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning, pp. 993–1000, 2009a.
- A convergent o (n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in neural information processing systems, 21(21):1609–1616, 2009b.
- On the rate of convergence and error bounds for lstd (λ𝜆\lambdaitalic_λ). In International Conference on Machine Learning, pp. 1521–1529. PMLR, 2015.
- Tesauro, G. et al. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
- On the performance of temporal difference learning with neural networks. In The Eleventh International Conference on Learning Representations, 2022.
- Convergent tree backup and retrace with function approximation. In International Conference on Machine Learning, pp. 4955–4964. PMLR, 2018.
- Analysis of temporal-diffference learning with function approximation. Advances in neural information processing systems, 9, 1996.
- Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
- Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
- A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning, pp. 10555–10565. PMLR, 2020.
- Finite-sample analysis for sarsa with linear function approximation. Advances in neural information processing systems, 32, 2019.