Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Temporal Difference Learning with Experience Replay (2306.09746v1)

Published 16 Jun 2023 in cs.LG and cs.AI

Abstract: Temporal-difference (TD) learning is widely regarded as one of the most popular algorithms in reinforcement learning (RL). Despite its widespread use, it has only been recently that researchers have begun to actively study its finite time behavior, including the finite time bound on mean squared error and sample complexity. On the empirical side, experience replay has been a key ingredient in the success of deep RL algorithms, but its theoretical effects on RL have yet to be fully understood. In this paper, we present a simple decomposition of the Markovian noise terms and provide finite-time error bounds for TD-learning with experience replay. Specifically, under the Markovian observation model, we demonstrate that for both the averaged iterate and final iterate cases, the error term induced by a constant step-size can be effectively controlled by the size of the replay buffer and the mini-batch sampled from the experience replay buffer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Online target q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. arXiv preprint arXiv:2110.08440, 2021.
  2. Agent57: Outperforming the atari human benchmark. In International Conference on Machine Learning, pages 507–517. PMLR, 2020.
  3. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  4. Neuro-dynamic programming. Athena Scientific, 1996.
  5. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pages 1691–1692. PMLR, 2018.
  6. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
  7. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
  8. Statistical inference. Cengage Learning, 2021.
  9. Chi-Tsong Chen. Linear system theory and design. Saunders college publishing, 1984.
  10. Finite-sample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874, 2020.
  11. Finite sample analyses for td (0) with function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  12. Sim and real: Better together. Advances in Neural Information Processing Systems, 34:6868–6880, 2021.
  13. Analysis of stochastic processes through replay buffers. In International Conference on Machine Learning, pages 5039–5060. PMLR, 2022.
  14. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.
  15. A theoretical analysis of deep q-learning. In Learning for Dynamics and Control, pages 486–489. PMLR, 2020.
  16. Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pages 3061–3071. PMLR, 2020.
  17. Abhijit Gosavi. Boundedness of iterates in q-learning. Systems & control letters, 55(4):347–349, 2006.
  18. Topological experience replay. arXiv preprint arXiv:2203.15845, 2022.
  19. Characterizing the exact behaviors of temporal difference learning algorithms using markov jump linear system theory. Advances in neural information processing systems, 32, 2019.
  20. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 6, 1993.
  21. Streaming linear system identification with reverse experience replay. arXiv preprint arXiv:2103.05896, 2021.
  22. Hassan K Khalil. Nonlinear control, volume 406. Pearson New York, 2015.
  23. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
  24. Streaming linear system identification with reverse experience replay. Advances in Neural Information Processing Systems, 34:30140–30152, 2021.
  25. Look back when surprised: Stabilizing reverse experience replay for neural approximation. arXiv preprint arXiv:2206.03171, 2022.
  26. Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics, pages 1347–1355. PMLR, 2018.
  27. Improved regret bound and experience replay in regularized policy iteration. In International Conference on Machine Learning, pages 6032–6042. PMLR, 2021.
  28. Target-based temporal-difference learning. In International Conference on Machine Learning, pages 3713–3722, 2019.
  29. Analysis of temporal difference learning: Linear system approach. arXiv preprint arXiv:2204.10479, 2022.
  30. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  31. Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3):293–321, 1992.
  32. Finrl: A deep reinforcement learning library for automated stock trading in quantitative finance. arXiv preprint arXiv:2011.09607, 2020.
  33. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  34. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
  35. Least squares regression with markovian data: Fundamental limits and algorithms. Advances in neural information processing systems, 33:16666–16676, 2020.
  36. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
  37. Proximal algorithms. Foundations and trends® in Optimization, 1(3):127–239, 2014.
  38. Daniel Paulin. Concentration inequalities for markov chains by marton couplings and spectral methods. Electronic Journal of Probability, 20:1–32, 2015.
  39. Egor Rotinov. Reverse experience replay. arXiv preprint arXiv:1910.08780, 2019.
  40. Truly deterministic policy optimization. arXiv preprint arXiv:2205.15379, 2022.
  41. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  42. Mohit Sewak. Deep q network (dqn), double dqn, and dueling dqn. In Deep Reinforcement Learning, pages 95–108. Springer, 2019.
  43. Improving the cryptocurrency price prediction performance based on reinforcement learning. IEEE Access, 9:162651–162659, 2021.
  44. Parrot: Data-driven behavioral priors for reinforcement learning. arXiv preprint arXiv:2011.10024, 2020.
  45. Reinforcement learning with replacing eligibility traces. Machine learning, 22(1-3):123–158, 1996.
  46. Finite-time error bounds for linear stochastic approximation andtd learning. In Conference on Learning Theory, pages 2803–2830. PMLR, 2019.
  47. Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  48. Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
  49. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
  50. Q-learning. Machine learning, 8:279–292, 1992.
  51. A robotic model of hippocampal reverse replay for reinforcement learning. Bioinspiration & Biomimetics, 18(1):015007, 2022.
  52. A deeper look at experience replay. arXiv preprint arXiv:1712.01275, 2017.
Citations (1)

Summary

We haven't generated a summary for this paper yet.